Paralel Computing

Parallel Computing

Teacher is Nurbek SaparkhojayevLecture#1: Introduction to Parallel Computing

Lecture#1 outline

Background

Why use parallel computing?

Who and What?

Concepts and Terminology

Parallel Computer Memory Architectures

BackgroundTraditionally, software has been written for serial computation:

* To be run on a single computer having a single Central Processing Unit (CPU);

* A problem is broken into a discrete series of instructions. * Instructions are executed one after another.

* Only one instruction may execute at any moment in time.

Cont.

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:

* To be run using multiple CPUs * A problem is broken into discrete parts that can be solved concurrently

* Each part is further broken down to a series of instructions * Instructions from each part execute simultaneously on different CPUs

Parallel Computing

The compute resources can include: * A single computer with multiple processors;

* An arbitrary number of computers connected by a network; * A combination of both.

The computational problem usually demonstrates characteristics such as the ability to be:

* Broken apart into discrete pieces of work that can be solved simultaneously; * Execute multiple program instructions at any moment in time;

* Solved in less time with multiple compute resources than with a single compute resource.

The Universe is Parallel:

Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. For

example: * Galaxy formation

* Planetary movement * Weather and ocean patterns

* Tectonic plate drift * Rush hour traffic

* Automobile assembly line * Building a space shuttle

* Ordering a hamburger at the drive through.

The Real World is Massively Parallel

Uses for Parallel Computing:

Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult scientific and engineering

problems found in the real world. Some examples: * Atmosphere, Earth, Environment

* Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics

* Bioscience, Biotechnology, Genetics * Chemistry, Molecular Sciences

* Geology, Seismology * Mechanical Engineering - from prosthetics to spacecraft * Electrical Engineering, Circuit Design, Microelectronics

* Computer Science, Mathematics

Some nice photos

Different applications

Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing

of large amounts of data in sophisticated ways. For example: * Databases, data mining

* Oil exploration * Web search engines, web based business services

* Medical imaging and diagnosis * Pharmaceutical design

* Management of national and multi-national corporations * Financial and economic modeling

* Advanced graphics and virtual reality, particularly in the entertainment industry

* Networked video and multi-media technologies * Collaborative work environments

Nice photos

Why use Parallel computing?Main Reasons:

a. - Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel clusters can be built from cheap,

commodity components.

cont. b. - Solve larger problems: Many problems are so large and/or complex that it is impractical

or impossible to solve them on a single computer, especially given limited computer memory. For example:

* "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources.

* Web search engines/databases processing millions of transactions per second

cont. c. - Provide concurrency: A single compute resource can only do one thing at a time.

Multiple computing resources can be doing many things simultaneously. For example, the Access Grid (www.accessgrid.org) provides a global collaboration network where

people from around the world can meet and conduct work "virtually".

cont.d. - Use of non-local resources: Using compute resources on a wide area network,

or even the Internet when local compute resources are scarce. For example: * SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for a

compute power over 528 TeraFLOPS (as of August 04, 2008) * Folding@home (folding.stanford.edu) uses over 340,000 computers for a

compute power of 4.2 PetaFLOPS (as of November 4, 2008)

cont.e. - Limits to serial computing: Both physical and practical reasons pose significant

constraints to simply building ever faster serial computers: * Transmission speeds - the speed of a serial computer is directly dependent upon

how fast data can move through hardware. Absolute limits are the speed of light (30 cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond).

Increasing speeds necessitate increasing proximity of processing elements. * Limits to miniaturization - processor technology is allowing an increasing number

of transistors to be placed on a chip. However, even with molecular or atomic-level components, a limit will be reached on how small components can be.

* Economic limitations - it is increasingly expensive to make a single processor faster. Using a larger number of moderately fast commodity processors to achieve

the same (or better) performance is less expensive. Decision: Current computer architectures are increasingly relying upon hardware

level parallelism to improve performance: * Multiple execution units

* Pipelined instructions * Multi-core

Who and What?

Top500.org provides statistics on parallel computing users - the charts below are just a sample. Some things to note:

* Sectors may overlap - for example, research may be classified research. Respondents have to choose between the two.

* "Not Specified" is by far the largest application - probably means multiple applications.

Who's doing Parallel Computing?

Future

The Future: * During the past 20 years, the trends indicated by ever

faster networks, distributed systems, and multi-processor computer architectures (even at the desktop level) clearly

show that parallelism is the future of computing.

Concepts and Terminologyvon Neumann Architecture

* Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers.

* Since then, virtually all computers have followed this basic design, which differed from earlier computers programmed through "hard wiring".

4 main components:1. Memory

2. Control Unit3. Arithmetic Logic Unit

4.Input/Output * Read/write, random access memory is used to store both program instructions and

data o Program instructions are coded data which tell the computer to do something

o Data is simply information to be used by the program * Control unit fetches instructions/data from memory, decodes the instructions and then

sequentially coordinates operations to accomplish the programmed task. * Aritmetic Unit performs basic arithmetic operations * Input/Output is the interface to the human operator

Von Neumann architecture

Flynn's Classical TaxonomyThere are different ways to classify parallel computers. One of the more widely

used classifications, in use since 1966, is called Flynn's Taxonomy.Flynn's taxonomy distinguishes multi-processor computer architectures

according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only

one of two possible states: Single or Multiple.There are 4 possible classifications according to Flynn:

SISDSIMDMISDMIMD

Flynn's Classical Taxonomy-SISD

Single Instruction, Single Data (SISD): * A serial (non-parallel) computer

* Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle

* Single data: only one data stream is being used as input during any one clock cycle

* Deterministic execution * This is the oldest and even today, the most common type of computer

* Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.

SISD

SIMD

Single Instruction, Multiple Data (SIMD):

* A type of parallel computer * Single instruction: All processing units execute the same instruction at any given

clock cycle * Multiple data: Each processing unit can operate on a different data element

* Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.

* Synchronous (lockstep) and deterministic execution * Two varieties: Processor Arrays and Vector Pipelines

* Examples: o Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV o Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2,

Hitachi S820, ETA10 * Most modern computers, particularly those with graphics processor units (GPUs)

employ SIMD instructions and execution units.

SIMD

SIMD

Multiple Instruction, Single Data (MISD):

* A single data stream is fed into multiple processing units. * Each processing unit operates on the data independently via independent instruction

streams. * Few actual examples of this class of parallel computer have ever existed. One is the

experimental Carnegie-Mellon C.mmp computer (1971). * Some conceivable uses might be:

o multiple frequency filters operating on a single signal stream o multiple cryptography algorithms attempting to crack a single coded message.

MISD

Multiple Instruction, Multiple Data (MIMD)

* Currently, the most common type of parallel computer. Most modern computers fall into this category.

* Multiple Instruction: every processor may be executing a different instruction stream

* Multiple Data: every processor may be working with a different data stream * Execution can be synchronous or asynchronous, deterministic or non-

deterministic * Examples: most current supercomputers, networked parallel computer

clusters and "grids", multi-processor SMP computers, multi-core PCs. * Note: many MIMD architectures also include SIMD execution sub-

components

MIMD

Some General Parallel TerminologyTask- A logically discrete section of computational work. A task is typically a program or

program-like set of instructions that is executed by a processor.Parallel Task- A task that can be executed by multiple processors safely (yields correct

results)Serial Execution - Execution of a program sequentially, one statement at a time. In the

simplest sense, this is what happens on a one processor machine. However, virtually all parallel tasks will have sections of a parallel program that must be executed

serially.Parallel Execution- Execution of a program by more than one task, with each task being

able to execute the same or different statement at the same moment in time.Pipelining - Breaking a task into steps performed by different processor units, with inputs

streaming through, much like an assembly line; a type of parallel computing.Shared Memory - From a strictly hardware point of view, describes a computer

architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same

logical memory locations regardless of where the physical memory actually exists.

TerminologySymmetric Multi-Processor (SMP) -Hardware architecture where multiple

processors share a single address space and access to all resources; shared memory computing.

Distributed Memory - In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can

only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing.

Communications - Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory

bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method

employed.Synchronization - The coordination of parallel tasks in real time, very often

associated with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase.

Terminology

Granularity- In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.

* Coarse: relatively large amounts of computational work are done between communication events

* Fine: relatively small amounts of computational work are done between communication events Observed Speedup

Observed speedup of a code which has been parallelized, defined as:

wall-clock time of serial execution -----------------------------------

wall-clock time of parallel execution

One of the simplest and most widely used indicators for a parallel program's performance.

TerminologyParallel Overhead- The amount of time required to coordinate parallel tasks, as opposed to doing

useful work. Parallel overhead can include factors such as: * Task start-up time * Synchronizations

* Data communications * Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.

* Task termination time Massively Parallel- Refers to the hardware that comprises a given parallel system - having many

processors. The meaning of "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors numbering in the hundreds of thousands.

Embarrassingly Parallel- Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks.

Scalability- Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that

contribute to scalability include: * Hardware - particularly memory-cpu bandwidths and network communications

* Application algorithm * Parallel overhead related

* Characteristics of your specific application and coding Multi-core Processors- Multiple processors (cores) on a single chip.

Cluster Computing-Use of a combination of commodity units (processors, networks or SMPs) to build a parallel system.

Supercomputing / High Performance Computing- Use of the world's fastest, largest machines to solve large problems.

Parallel Computer Memory Architectures

a. Shared Memory- General Characteristics: * Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address

space. * Multiple processors can operate independently but share the same memory

resources. * Changes in a memory location effected by one processor are visible to all

other processors. * Shared memory machines can be divided into two main classes based upon

memory access times: UMA and NUMA.

UMAUniform Memory Access (UMA):

* Most commonly represented today by Symmetric Multiprocessor (SMP) machines

* Identical processors * Equal access and access times to memory

* Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the

other processors know about the update. Cache coherency is accomplished at the hardware

level.

UMA

Non-Uniform Memory Access (NUMA):

* Often made by physically linking two or more SMPs * One SMP can directly access memory of another SMP

* Not all processors have equal access time to all memories * Memory access across link is slower

* If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

NUMA

Advantages & DisadvantagesAdvantages:

* Global address space provides a user-friendly programming perspective to memory

* Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages:

* Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared

memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management.

* Programmer responsibility for synchronization constructs that ensure "correct" access of global memory.

* Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of

processors.

Distributed MemoryGeneral Characteristics:

Like shared memory systems, distributed memory systems vary widely but share a common characteristic.

Distributed memory systems require a communication network to connect inter-processor memory.

Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address

space across all processors.Because each processor has its own local memory, it operates independently.

Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.

When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is

communicated. Synchronization between tasks is likewise the programmer's responsibility.

The network "fabric" used for data transfer varies widely, though it can can be as simple as Ethernet.

Distributed Memory

Distributed Memory

Advantages: Memory is scalable with number of processors. Increase the number of

processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference

and without the overhead incurred with trying to maintain cache coherency.Cost effectiveness: can use commodity, off-the-shelf processors and

networking. Disadvantages:

The programmer is responsible for many of the details associated with data communication between processors.

It may be difficult to map existing data structures, based on global memory, to this memory organization.

Non-uniform memory access (NUMA) times

Hybrid Distributed-Shared Memory

The largest and fastest computers in the world today employ both shared and distributed memory architectures.

The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global.

The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP.

Therefore, network communications are required to move data from one SMP to another.

Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the

foreseeable future.Advantages and Disadvantages: whatever is common to both shared and

distributed memory architectures.

The end of the first lecture!!

QUESTIONS? Comments? Requests?

Parallel Computing

Teacher is Nurbek SaparkhojayevLecture#2:Parallel Programming Models

Models

There are several parallel programming models in common use: o Shared Memory

o Threads o Message Passing

o Data Parallel o Hybrid

Parallel programming models exist as an abstraction above hardware and memory architectures.

Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples:

1st Model

1. Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach.

Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory". Note: although KSR is no longer in

business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future.

2nd Model

2. Message passing model on a shared memory machine: MPI on SGI Origin.

The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However,

the ability to send and receive messages with MPI, as is commonly done over a network of distributed memory machines, is not only implemented but is very

commonly used.

* Which model to use is often a combination of what is available and personal choice. There is no "best" model, although there certainly are better

implementations of some models over others.

* The following sections describe each of the models mentioned above, and also discuss some of their actual implementations.

Shared Memory Model(detailed)

In the shared-memory programming model, tasks share a common address space, which they read and write asynchronously.

Various mechanisms such as locks / semaphores may be used to control access to the shared memory.

An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the

communication of data between tasks. Program development can often be simplified.

An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality.

Keeping data local to the processor that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors

use the same data. Unfortunately, controlling data locality is hard to understand and beyond

the control of the average user.

Shared Memory Model(detailed)

Implementations:

On shared memory platforms, the native compilers translate user program variables into actual memory addresses, which are global.

No common distributed memory platform implementations currently exist. However, as mentioned previously in the Overview section, the KSR

ALLCACHE approach provided a shared memory view of data even though the physical memory of the machine was distributed.

Threads Model

In the threads model of parallel programming, a single process can have multiple, concurrent

execution paths.Perhaps the most simple analogy that can be

used to describe threads is the concept of a single program that includes a number of subroutines:

Threads Model

Threads Model(Code)

The main program a.out is scheduled to run by the native operating system. a.out loads and acquires all of the necessary system and user resources to run.

a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently.

Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread. Each thread also benefits from a global memory view because it

shares the memory space of a.out.

A thread's work may best be described as a subroutine within the main program. Any thread can execute any subroutine at the same time as other

threads.

Threads Model

Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more

than one thread is not updating the same global address at any time.

Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed.

Threads are commonly associated with shared memory architectures and operating systems.

Threads Implementations:

From a programming perspective, threads implementations commonly comprise:

A library of subroutines that are called from within parallel source code A set of compiler directives imbedded in either serial or parallel source

code

In both cases, the programmer is responsible for determining all parallelism.

Threaded implementations are not new in computing. Historically, hardware vendors have implemented their own proprietary versions of threads. These implementations differed substantially from each other making it difficult for

programmers to develop portable threaded applications.

Threads Implementations:Unrelated standardization efforts have resulted in two very different

implementations of threads: POSIX Threads and OpenMP.# POSIX Threads

* Library based; requires parallel coding * Specified by the IEEE POSIX 1003.1c standard (1995).

* C Language only * Commonly referred to as Pthreads.

* Most hardware vendors now offer Pthreads in addition to their proprietary threads implementations.

* Very explicit parallelism; requires significant programmer attention to detail. # OpenMP

* Compiler directive based; can use serial code * Jointly defined and endorsed by a group of major computer hardware and software vendors. The OpenMP Fortran API was released October 28, 1997.

The C/C++ API was released in late 1998. * Portable / multi-platform, including Unix and Windows NT platforms

* Available in C/C++ and Fortran implementations * Can be very easy and simple to use - provides for "incremental parallelism"

# Microsoft has its own implementation for threads, which is not related to the UNIX POSIX standard or OpenMP.

More Information:

POSIX Threads tutorial: computing.llnl.gov/tutorials/pthreads

OpenMP tutorial: computing.llnl.gov/tutorials/openMP

TerminologyPerformance: A quantifiable measure of rate of doing (computational) work

Multiple such measures of performance

Delineated at the level of the basic operation

ops – operations per second

ips – instructions per second

flops – floating operations per second

Rate at which a benchmark program takes to execute

A carefully crafted and controlled code used to compare systems

Linpack Rmax (Linpack flops)

gups (billion updates per second)

others

Two perspectives on performance

Peak performance Maximum theoretical performance possible for a system

Sustained performance Observed performance for a particular workload and run; Varies across workloads and possibly between runs

ScalabilityThe ability to deliver proportionally greater sustained performance through increased

system resources

Strong Scaling Fixed size application problem. Application size remains constant with increase in system size

Weak Scaling Variable size application problem. Application size scales proportionally with system size

Capability computing – In most pure form: strong scaling. Marketing claims tend toward this class

Capacity computing - Throughput computing. Includes job-stream workloads. In most simple form: weak scaling

Cooperative computing

Interacting and coordinating concurrent processes

Not a widely used term

Also: coordinated computing

The end of the first half of 2nd Lecture

Questions? Comments? Requests?

Parallel Computing

Teacher is Nurbek SaparkhojayevLecture#2:Parallel Programming Models

Message Passing Model

Message Passing Model

The message passing model demonstrates the following characteristics:# A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number

of machines.

# Tasks exchange data through communications by sending and receiving messages.

# Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive

operation.

Implementations: * From a programming perspective, message passing impl's commonly

comprise a library of subroutines that are imbedded in source code. The programmer is responsible for determining all parallelism.

* Historically, a variety of message passing libraries have been available since the 1980s. These implementations differed substantially from each other making it difficult for programmers to develop portable applications.

* In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.

* Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2 (MPI-2) was released in 1996. Both MPI specifications are available on the

web at http://www-unix.mcs.anl.gov/mpi/. * MPI is now the "de facto" industry standard for message passing, replacing

virtually all other message passing implementations used for production work. Most, if not all of the popular parallel computing platforms offer at least

one implementation of MPI. A few offer a full implementation of MPI-2. * For shared memory architectures, MPI implementations usually don't use a

network for task communications. Instead, they use shared memory (memory copies) for performance reasons.

More Info

MPI tutorial: computing.llnl.gov/tutorials/mpi

Data Parallel ModelThe data parallel model demonstrates the following characteristics: *

o Most of the parallel work focuses on performing operations on a data set. The data set is typically organized into a common structure, such as an

array or cube.

o A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.

o Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".

* On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the

data structure is split up and resides as "chunks" in the local memory of each task.

Data Parallel Model

Implementations:

Programming with the data parallel model is usually accomplished by writing a program with data parallel constructs. The constructs can be calls to a data

parallel subroutine library or, compiler directives recognized by a data parallel compiler.

Fortran 90 and 95 (F90, F95): ISO/ANSI standard extensions to Fortran 77. * Contains everything that is in Fortran 77

* New source code format; additions to character set * Additions to program structure and commands * Variable additions - methods and arguments

* Pointers and dynamic memory allocation added * Array processing (arrays treated as objects) added

* Recursive and new intrinsic functions added * Many other new features

Implementations are available for most common parallel platforms.

HPF

# High Performance Fortran (HPF): Extensions to Fortran 90 to support data parallel programming.

* Contains everything in Fortran 90 * Directives to tell compiler how to distribute data added

* Assertions that can improve optimization of generated code added * Data parallel constructs added (now part of Fortran 95)

HPF compilers were common in the 1990s, but are no longer commonly implemented.

# Compiler Directives: Allow the programmer to specify the distribution and alignment of data. Fortran implementations are available for most common

parallel platforms.# Distributed memory implementations of this model usually have the compiler

convert the program into standard code with calls to a message passing library (MPI usually) to distribute the data to all the processes. All message

passing is done invisibly to the programmer.

Other Models

Other parallel programming models besides those previously mentioned certainly exist, and will

continue to evolve along with the ever changing world of computer hardware and software. Only three of the more common ones are mentioned

here.

Hybrid

# In this model, any two or more parallel programming models are combined.

# Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with either the threads model (POSIX threads) or the shared memory model (OpenMP). This hybrid model lends itself well to

the increasingly common hardware environment of networked SMP machines.

# Another common example of a hybrid model is combining data parallel with message passing. As mentioned in the data parallel model section

previously, data parallel implementations (F90, HPF) on distributed memory architectures actually use message passing to transmit data between tasks,

transparently to the programmer.

Single Program Multiple Data (SPMD)

SPMD

SPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. # A

single program is executed by all tasks simultaneously.

# At any moment in time, tasks can be executing the same or different instructions within the same program.

# SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. That is, tasks do not necessarily have

to execute the entire program - perhaps only a portion of it.

# All tasks may use different data

Multiple Program Multiple Data (MPMD)

MPMD

Like SPMD, MPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming

models. MPMD applications typically have multiple executable object files (programs).

While the application is being run in parallel, each task can be executing the same or different program as other tasks.

# All tasks may use different data

Parallel Computing

Teacher is Nurbek Saparkhojayev

Lecture#3:Designing Parallel Programs

Automatic vs. Manual Parallelization

Designing and developing parallel programs has characteristically been a very manual process.

The programmer is typically responsible for both identifying and actually implementing parallelism.

Very often, manually developing parallel codes is a time consuming, complex, error-prone and

iterative process.

For a number of years now, various tools have been available to assist the programmer with

converting serial programs into parallel programs.

The most common type of tool used to automatically parallelize a serial program is a parallelizing

compiler or pre-processor.

A parallelizing compiler generally works in two different ways:

1. Fully Automatic

The compiler analyzes the source code and identifies opportunities for parallelism.

The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on

whether or not the parallelism would actually improve performance.

Loops (do, for) loops are the most frequent target for automatic parallelization.

Automatic vs. Manual Parallelization

2. Programmer Directed

Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the

compiler how to parallelize the code.

May be able to be used in conjunction with some degree of automatic parallelization also.

If you are beginning with an existing serial code and have time or budget constraints, then automatic

parallelization may be the answer.

However, there are several important caveats that apply to automatic parallelization:

* Wrong results may be produced

* Performance may actually degrade

* Much less flexible than manual parallelization

* Limited to a subset (mostly loops) of code

* May actually not parallelize code if the analysis suggests there are inhibitors or the code

is too complex

The remainder of this section applies to the manual method of developing parallel codes.

Understand the Problem and the

Program

1.Understand the problem you are trying to solve.

2. Think about the option of parallelizing this problem. Can you parallel this problem or not?

Example of Parallelizable Problem:

Calculate the potential energy for each of several thousand independent conformations of a

molecule. When done, find the minimum energy conformation.

This problem is able to be solved in parallel. Each of the molecular conformations is independently

determinable. The calculation of the minimum energy conformation is also a parallelizable problem.

# Example of a Non-parallelizable Problem:

Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of the formula:

F(n) = F(n-1) + F(n-2)

This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown

would entail dependent calculations rather than independent ones. The calculation of the F(n) value

uses those of both F(n-1) and F(n-2). These three terms cannot be calculated independently and

therefore, not in parallel.

Understand the Problem and the

Program

3. Identify the program's hotspots:

Know where most of the real work is being done. The majority of scientific and technical

programs usually accomplish most of their work in a few places.

Profilers and performance analysis tools can help here

Focus on parallelizing the hotspots and ignore those sections of the program that account for little

CPU usage.

4. Identify bottlenecks in the program

Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred?

For example, I/O is usually something that slows a program down.

May be possible to restructure the program or use a different algorithm to reduce or eliminate

unnecessary slow areas

5. Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as

demonstrated by the Fibonacci sequence above.

6. Investigate other algorithms if possible. This may be the single most important consideration when

designing a parallel application.

Partitioning

One of the first steps in designing a parallel program is to break the problem into

discrete "chunks" of work that can be distributed to multiple tasks. This is known

as decomposition or partitioning.

There are two basic ways to partition computational work among parallel tasks:

domain decomposition and functional decomposition.

However, combining these two types of problem decomposition is common and

natural.

a. Domain Decomposition

In this type of partitioning, the data associated with a problem is decomposed. Each parallel task

then works on a portion of of the data.

a. Domain DecompositionThere are different ways to partition data:

b. Functional Decomposition

In this approach, the focus is on the computation that is to be performed rather than on the data

manipulated by the computation. The problem is decomposed according to the work that must be

done. Each task then performs a portion of the overall work.

b. Functional Decomposition

Functional decomposition lends itself well to problems that can be split into different tasks. For

example:

1. Ecosystem Modeling

Each program calculates the population of a given group, where each group's growth depends on

that of its neighbors. As time progresses, each process calculates its current state, then exchanges

information with the neighbor populations. All tasks then progress to calculate the state at the next

time step.

b. Functional Decomposition2. Signal Processing:

An audio signal data set is passed through four distinct computational filters. Each filter is a separate

process. The first segment of data must pass through the first filter before progressing to the second.

When it does, the second segment of data passes through the first filter. By the time the fourth

segment of data is in the first filter, all four tasks are busy.

b. Functional Decomposition3. Climate Modeling

Each model component can be thought of as a separate task. Arrows represent exchanges of data

between components during computation: the atmosphere model generates wind velocity data that

are used by the ocean model, the ocean model generates sea surface temperature data that are

used by the atmosphere model, and so on.

Communications

Who Needs Communications?

The need for communications between tasks depends upon your problem:

You DON'T need communications:

- Some types of problems can be decomposed and executed in parallel with virtually no need

for tasks to share data. For example, imagine an image processing operation where every pixel in a

black and white image needs to have its color reversed. The image data can easily be distributed to

multiple tasks that then act independently of each other to do their portion of the work.

- These types of problems are often called embarrassingly parallel because they are so

straight-forward. Very little inter-task communication is required.

You DO need communications

- Most parallel applications are not quite so simple, and do require tasks to share data with

each other. For example, a 3-D heat diffusion problem requires a task to know the temperatures

calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect

on that task's data.

Factors to Consider:

There are a number of important factors to consider when designing your program's inter-task

communications:

Cost of communications

- Inter-task communication virtually always implies overhead.

- Machine cycles and resources that could be used for computation are instead

used to package and transmit data.

- Communications frequently require some type of synchronization between

tasks, which can result in tasks spending time "waiting" instead of doing work.

- Competing communication traffic can saturate the available network

bandwidth, further aggravating performance problems.

Latency vs. Bandwidth

Latency is the time it takes to send a minimal (0 byte) message from point A to point B.

Commonly expressed as microseconds.

Bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed

as megabytes/sec or gigabytes/sec.

Sending many small messages can cause latency to dominate communication overheads. Often

it is more efficient to package small messages into a larger message, thus increasing the effective

communications bandwidth.

Factors to considerVisibility of communications

With the Message Passing Model, communications are explicit and generally quite visible and under

the control of the programmer.

With the Data Parallel Model, communications often occur transparently to the programmer,

particularly on distributed memory architectures. The programmer may not even be able to know exactly

how inter-task communications are being accomplished.

Synchronous vs. asynchronous communications

Synchronous communications require some type of "handshaking" between tasks that are sharing

data. This can be explicitly structured in code by the programmer, or it may happen at a lower level

unknown to the programmer.

Synchronous communications are often referred to as blocking communications since other work

must wait until the communications have completed.

Asynchronous communications allow tasks to transfer data independently from one another. For

example, task 1 can prepare and send a message to task 2, and then immediately begin doing other

work. When task 2 actually receives the data doesn't matter.

Asynchronous communications are often referred to as non-blocking communications since other

work can be done while the communications are taking place.

Interleaving computation with communication is the single greatest benefit for using asynchronous

communications.

Factors to consider

Scope of communications

Knowing which tasks must communicate with each other is critical during the design stage of a

parallel code. Both of the two scopings described below can be implemented synchronously or

asynchronously.

Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the

other acting as the receiver/consumer.

Collective - involves data sharing between more than two tasks, which are often specified as

being members in a common group, or collective. Some common variations (there are more):

Factors to consider

Factors to consider

Efficiency of communications

Very often, the programmer will have a choice with regard to factors that can affect

communications performance. Only a few are mentioned here.

Which implementation for a given model should be used? Using the Message Passing Model as

an example, one MPI implementation may be faster on a given hardware platform than another.

What type of communication operations should be used? As mentioned previously, asynchronous

communication operations can improve overall program performance.

Network media - some platforms may offer more than one network for communications. Which

one is best?

Overhead and Complexity

Synchronization

Types of Synchronization:

1. Barrier- Usually implies that all tasks are involved

Each task performs its work until it reaches the barrier. It then stops, or "blocks".

When the last task reaches the barrier, all tasks are synchronized.

What happens from here varies. Often, a serial section of work must be done. In other cases,

the tasks are automatically released to continue their work.

2. Lock / semaphore - Can involve any number of tasks

Typically used to serialize (protect) access to global data or a section of code. Only one task

at a time may use (own) the lock / semaphore / flag.

The first task to acquire the lock "sets" it. This task can then safely (serially) access the

protected data or code.

Other tasks can attempt to acquire the lock but must wait until the task that owns the lock

releases it. Can be blocking or non-blocking

3. Synchronous communication operations

Involves only those tasks executing a communication operation

When a task performs a communication operation, some form of coordination is required with

the other task(s) participating in the communication. For example, before a task can perform a send

operation, it must first receive an acknowledgment from the receiving task that it is OK to send.

Discussed previously in the Communications section

The end of the lecture

Questions?Comments?

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

PARALLEL COMPUTER ARCHITECTURE

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityJanuary 20, 2011


2

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test


3




4

Opening Remarks

• This lecture is an introduction to supercomputer architecture– Major parameters, classes, and system level

• Architecture exploits device technology to deliver its innate computation performance potential– Structures and system organization – Semantics of operation and memory (instruction set architecture, ISA)

• Between device technology and architecture is circuit design– Circuit design converts devices to logic gates and higher level logical

structures (e.g. multiplexers, adders)– but this is outside the scope of this course.

• We will assume basic logic abstraction with characterizing properties:– Functional behavior (the logical operation it performs)– Switching speed– Propagation delay or latency– Size and power


HPC System Stack

5

Science Problems : Environmental Modeling, Physics, Computational Chemistry, etc.Application : Coastal Modeling, Black hole simulations, etc.

Algorithms : PDE, Gaussian Elimination, 12 Dwarves, etc.

Program Source Code

Programming Languages: Fortran, C, C++ , UPC, Fortress, X10, etc.Compilers : Intel C/C++/Fortran Compilers, PGI C/C++/Fortran, IBM XLC, XLC++, XLF, etc.Runtime Systems : Java Runtime, MPI etc.

Operating Systems : Linux, Unix, AIX etc.

Systems Architecture : Vector, SIMD array, MPP, Commodity Cluster

Firmware : Motherboard chipset, BIOS, NIC drivers,

Microarchitectures : Intel/AMD x86, SUN SPARC, IBM Power 5/6

Logic Design : RTL

Circuit Design : ASIC, FPGA, Custom VLSI

Device Technology : NMOS, CMOS, TTL, Optical

Mod

el o

f C

om

pu

tatio

n


6




7

Performance Factors: Technology Speed

• Latencies– Logic latency time– Processor to memory access latency– Memory access time– Network latency

• Cycle Times– Logic switching speed– On-chip clock speed (clock cycle time)– Memory cycle time

• Throughput– On-chip data transfer rate– Instructions per cycle– Network data rate

• Granularity– Logic density– Memory density– Task Size– Packet Size


8

Machine Parameters affecting Performance

• Peak floating point performance• Main memory capacity• Bi-section bandwidth• I/O bandwidth• Secondary storage capacity• Organization

– Class of system– # nodes– # processors per node– Accelerators– Network topology

• Control strategy– MIMD– Vector, PVP– SIMD– SPMD


9

Performance Factors: Parallelism

• Fully independent processing elements operating concurrently on separate tasks– Coarse grained – Communicating Sequential Processes (CSP), – Single Program Multiple Data stream (SPMD)

• Instruction Level Parallelism (ILP)– Fine grained – Single instruction performs multiple operations

• Pipelining– Fine grained– Overlapping sequential operations in execution pipeline– Vector pipelines

• SIMD operations– Fine / Medium grained– Single Instruction stream, Multiple Data stream – ALU arrays

• Overlapping of computation and communication– Fine / Medium grained– Asynchronous– Prefetching

• Multithreading– Medium grained– Separate instruction streams serve single processor


10

Sources of Performance Degradation (SLOW)

• Starvation– Not enough work to do among distributed resources– Insufficient parallelism– Inadequate load balancing– e.g. : Amdahl's law

• Latency– Time required for response of access to remote data or services– Waiting for access to memory or other parts of the system– e.g. : Local memory access, Network communication

• Overhead– Extra work that has to be done to manage program concurrency and parallel

resources the real work you want to perform– Critical-path work for management of concurrent tasks and parallel resources not

required for sequential execution– e.g. : Synchronization and scheduling

• Waiting for Contention– Delays due to conflicts for use of shared resources.– e.g. : Memory bank conflicts, shared network channels


11




12

Computer Architecture

• Structure– Functional elements– Organization and balance– Interconnect and Data flow paths

• Semantics– Meaning of the logical constructs– Primitive data types– Manifest as Instruction Set Architecture abstract layer

• Mechanisms– Primitive functions that are usually implemented in hardware or

sometimes firmware– Determines preferred actions and sequences– Enables efficiency and scalability

• Policy– Approach and priorities to accomplishing a goal– e.g., cache replacement policy


13

Structure

• Functional elements– The form of functional elements made up of more primitive

logical modules– e.g. vector arithmetic unit comprising a pipeline of simple stages

• Organization and balance– Number of major elements of different types– Hierarchy of collections of elements

• Data flow– Interconnection of functional, state, and communication

elements– Control of dataflow paths determines actions of processor and

system


14

Semantics

• Meaning of the logical constructs– Basic operations that can be performed on data

• Primitive data types– What collections of bits (e.g. word) means– Defines actions that can be performed on binary strings

• Instruction Set Architecture– Defined set of actions that can be performed and data object on

which they can be applied– Encoding of binary strings to represent distinct instructions

• Parallel control constructs– Hardware implemented : vector operations, – Software implemented : MPI libraries


15

Mechanisms• Primitive functions that are usually implemented in

hardware or sometimes firmware– Lower level than instruction set operations– Multiple such mechanisms contribute to execution of

operation

• Determines preferred actions and sequences– Usually time effective primitives– Usually widely used by many instructions

• Enables efficiency and scalability– Establishes basic performance properties of machine

• Examples– Basic arithmetic and logic unit functions– Thread context switching– TLB (Translation Lookaside Buffer) address translation– Cache line replacement– Branch prediction


16

Policy• Hardware architecture policies

– Decision of ordering or allocation dependent on criteria– Not all machine decisions are visible to the ISA of the system– Not all machine choices are available to the name space of the

operands– Examples

• Cache structure, size, and speed • Cache replacement policies• Order of operation execution• Branch prediction• Allocation of shared resources• Network routers

• Software system management policies– Scheduling,– Data allocation : partitioning of a problem


17




18

Parallel Structures & Performance Issues

• Pipelining– Vector processing– Execution pipeline– Performance Issues:

• Pipelining increases throughput : More operations per unit time• Pipelining increases latency time : Operation on single operand pair

can take longer than non-pipelined functional unit

• Multiple Arithmetic Units– Instruction level parallelism– Systolic arrays– Performance Issues:

• Increases peak performance• Requires application instruction level parallelism• Average usually significantly lower than peak


19

Parallel Structures & Performance Issues

• Multiple processors– MIMD: Separate control– SIMD: Single controller– Multicore– Accelerators

• Performance Issues: Multiple processors require overhead operations– Synchronization– Communications – Possibly cache Coherence


20




21

Scalability• The ability to deliver proportionally greater sustained performance through

increased system resources• Strong Scaling

– Fixed size application problem– Application size remains constant with increase in system size

• Weak Scaling– Variable size application problem– Application size scales proportionally with system size

• Capability computing– in most pure form: strong scaling– Marketing claims tend toward this class

• Capacity computing– Throughput computing– Includes job-stream workloads

– In most simple form: weak scaling

• Cooperative computing– Interacting and coordinating concurrent processes– Not a widely used term– Also: “coordinated computing”


22

Performance Metrics

• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput

– Average performance over a period of time– flops, Mflops, Gflops, Tflops, Pflops – flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, ops, Mops …

• Cycles per instruction– cpi – Alternatively: instructions per cycle, ipc

• Memory access latency– cycles per second

• Memory access bandwidth– bytes per second (Bps)– bits per second (bps)– or Gigabytes per second, GBps, GB/s

• Bi-section bandwidth– bytes per second


23




Basic Uni-processor Architecture elements

• I/O Interface

• Memory Interface

• Cache hierarchy

• Register Sets

• Control

• Execution pipeline

• Arithmetic Logic Units

24


25

Multiprocessor• A general class of system• Integrates multiple processors in to an interconnected ensemble• MIMD: Multiple Instruction Stream Multiple Data Stream• Different memory models

– Distributed memory• Nodes support separate address spaces

– Shared memory• Symmetric multiprocessor• UMA – uniform memory access• Cache coherent

– Distributed shared memory• NUMA – non uniform memory access• Cache coherent

– PGAS• Partitioned global address space• NUMA• Not cache coherence

– Hybrid : Ensemble of distributed shared memory nodes• Massively Parallel Processor, MPP


26

Massively Parallel Processor

• MPP• General class of large scale multiprocessor• Represents largest systems

– IBM BG/L– Cray XT3

• Distinguished by memory strategy– Distributed memory– Distributed shared memory

• Cache coherent• Partitioned global address space

• Custom interconnect network• Potentially heterogeneous

– May incorporate accelerator to boost peak performance


DM - MPP

27


28

IBM Blue Gene/L


Historical Top-500 List

29


30

BG/L packaging hierarchy


ASCI REDCompute Nodes 4,536

Service Nodes 32

Disk I/O Nodes 32

System Nodes (Boot) 2

Network Nodes (Ethernet, ATM) 10

System Footprint 1,600 Square Feet

Number of Cabinets 85

System RAM 594 Mbytes

Topology 38x32x2

Node to Node bandwidth - Bi-directional 800 Mbytes/sec

Bi-directional - Cross section Bandwidth 51.6 Gbytes/sec

Total number of Pentiumâ Pro Processors 9,216

Processor to Memory Bandwidth 533 Mbytes/sec

Compute Node Peak Performance 400 MFLOPS

System Peak Performance 1.8 TFLOPS

RAID I/O Bandwidth (per subsystem) 1.0 Gbytes/sec

RAID Storage (per subsystem) 1 Tbyte

31


ASCI RED : I/O Board

32


33




34

Pipeline Structures• Partitioning of functional unit into a sequence of stages

– Execution time of each stage is < that of the original unit– Total time through sequence of stages is usually > that of the

original unit• Pipeline permits overlapping of multiple operations

– At any one time: each stage is performing different operation– # of operations being performed in parallel = # stages

• Performance– Pipeline increments at clock rate of slowest pipeline stage– Response time for an operation is product of # stages and clock

cycle time– Throughput = clock rate

• i.e. one operation result per clock cycle of pipeline

• Pipeline structures employed in many parts of a computer architecture – to enable high throughput in the presence of high latency – enable faster clock rates


35

Pipeline : Concepts

Tc

tp

Tp

pp

cc

pc

pp

cp

tPerf

TPerf

TT

tNT

Tt

1

1

=

=

<

×=

<<

Where :

• Tc is the Logic Latency

• Tp is the aggregated pipeline latency

• tp is the latency for each pipelined step


36

Vector Processors• Supports fine grained data parallel semantics

– Many instances of same operation performed concurrently under same control element

– Operates on vector data structures rather than single scalar values– Vector-scalar operations

• Scale a vector by a scalar factor (multiply each vector element by scalar)

– Inter-vector operations• e.g., Pair wise multiplies

– Intra-vector operation• Reduction operators• e.g., sum all elements of a vector

• Exploits pipeline structure– Arithmetic units– Vector registers– Overlap of memory banks access cycles– Overlap of communication with computation

• Limited scaling – upper bound on number of pipeline stages


37

Vector Pipeline Architecture

Vector Register (NR)

Vector ALU (NS Stages)

M1 M2 MN

High speed memory busta : time for memory access

MVM

N

asMV

TNP

ttT

=

+= ∑

Where :

• ta is the time for memory access

• ts is the startup time

• TMV is the combined time for Memory Vector

• PM is the memory performance

• tc is the ALU clock time of each step

:

)(

1

21N

tNN

NmancectorPerforAchievedVe

tcerPerformanIdealVecto

cRs

R

c

×+=

=

€

NS := NR

PerfR =NR

(NR + NS) ×tc

= NR

2 × (NR) ×tc

= 12 × tc


38

Cray 1

Th

e C

ray 1

Syste

mC

ray 1

logic b

oard

s

• First announced in 1975-6 • 80 MHz Clock rate• Theoretical peak performance (160

MIPS), average performance 136 megaflops, vector optimized peak performance 150 megaflops

• 1-million 64 bit words of high speed memory

• Manufactured by Cray Research Inc.• First Customer was National Center for

Atmospheric Research (NCAR) for 8.86 million dollars.

src : http://en.wikipedia.org/wiki/Cray-1


39

SID

E B

AR


40

Parallel-Vector-Processors: PVP

• Combines strengths of vector and MPP– Efficiency of vector processing

• Capability computing

– Scalability of massively parallel processing• Capacity and cooperative computing

• Two levels of parallelism– Ultra fine grain vector parallelism with vector pipelining– Medium to coarse-grain processor

• Memory model– Alternative ways of organizing memory & address space– Distributed memory

• Shared memory within node of multiple vector processors• Fragmented or decoupled address space between nodes

– Partitioned global address space• Globally accessible address space• No cache coherence between nodes


PVP (e.g. Cray – XMP)

41


42

Earth Simulator

src : http://www.es.jamstec.go.jp/esc/eng/


43

EarthSimulator (Facts)

• Located in Yokohoma, Japan• Size of the entire center about 4 tennis courts• Can execute 35.86 trillion (35,860,000,000,000) FLOPS,

or 35.86 TFLOPS (LINPACK.• Consists of 640 nodes with each node consisting of 8

vector processors and 16 GB of memory• Totaling 5120 processors and 10 Terabytes of memory• Aggregated disk storage of 700 Terabytes and around

1.6 Petabytes of storage in tape drives • Costs about 350 million dollars• First on the Top500 list for 5 consecutive times.

Surpassed by IBM's BlueGene/L prototype on September 24, 2004

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011 44

PVP Examples

• Early machines– CRI XMP, YMP, C-90, T-90– Cray 2– Fujitsu VP5000

• SX-8

• Cray X1

Steve Scott

Cray Inc.


45




46

SIMD Array• SIMD semantics

– Single Instruction stream Multiple Data stream– Data set partitioned in to blocks upon which

• One or two dimensions (vectors or matrices)

– Each data block is processed separately– Each data block is controlled by same instruction sequence

– Data exchange cycle

• SIMD Parallel Structure– Node Array of arithmetic units, each coupled to local memory– Interconnect network for global data exchange– Single controller to issue instructions to array nodes

• Early systems broadcast one instruction at a time• Modern systems point to sequence of cached instructions

• SPMD– Single Program Multiple Data Stream– Microprocessor based system where each node runs same program


SIMD

47


48

MI

MD

Sequencer

Simplified SIMD Diagram

Data Processors

. . .

. . . . .

Switch

10 11

20 22

12

21

1n

2n

00 01 02 0n

n0 nn

Instruction BroadcastBus

Control Processor

MDij

Processing Element


49

CM-2

CM-2 General Specifications :• Processors 65,536 • Memory 512 Mbytes • Memory Bandwidth 300Gbits/Sec • I/O Channels 8 Capacity per Channel 40 Mbytes/Sec

Max. • Transfer Rate 320 Mbytes/Sec • Performance in excess of 2500 MIPS• Floating Point performance in excess of 2.5 GFlops

DataVault Specifications : • Storage Capacity 5 or 10 Gbytes • I/O Interfaces 2 Transfer Rate, • Burst 40 Mbytes/Sec Max. • Aggregate Rate 320 Mbytes/Sec

• Originated at MIT, by Danny Hillis• Commercialized at Thinking Machines Corp. src : http://www.svisions.com/sv/cm-dv.html


50

ClearSpeed SIMD Accelerator

• 1997 Intel ASCI Red Supercomputer• 1TFLOPS, 2,500 sq.

ft., 800KW, $55Million

• 2007 ClearSpeed + Intel Dense Cluster• 1 TFLOPS, 25 sq. ft.,

<7 KW, <$200K

• Medium-Coarse grained SIMD• 130nm fabrication technology• 250 MHz clock rate• 100 Gflops peak, 66 Gflops sustained


Tsubame

• Heterogeneous computing : Added ClearSpeed Boards

• 648 nodes resulting in 38.5 TFLOPS

• 648 nodes with 360 ClearSpeed boards to 47.38 TFLOPS

51


52




Special Purpose Devices • SPD• Optimized for a given algorithm or class of

problems• Functional elements and dataflow path mirror

the requirements of a specific algorithm• Usually exploits fine grain parallelism for very

high parallelism• Best for arithmetic (or logic) intensive

applications with limited memory access requirements

• Best for strong temporal and spatial locality• Systolic Arrays are one class of such

machines widely used in digital signal processing

• Examples– MD-Grape first Petaflops machine, for N-body

problem– GPU Graphics Processing Unit, e.g. NVIDIA– FPGA field programmable gate array

• Allows reconfiguration of logic array


Systolic Arrays

Host

Cell 1

Cell 2

Cell 3

Cell n

Interface Unit

address

XY

Warp Processor Array

XY

Example implementation:Warp architecture

∑=

=n

k

kjikij bac1

A

A

CC

BB

Processing Element

Matrix multiplication on Systolic ArrayReferences:M. Annaratone, E. Arnould, et al, “The Warp Computer: Architecture, Implementation, and Performance”Y. Yang, W. Zhao, and Y. Inoue, “High-Performance Systolic Arrays for Band Matrix Multiplication”


55




56

Introduction to SMP

• Symmetric Multiprocessor

• Building block for large MPP• Multiple processors

– 2 to 32 processors– Now Multicore

• Uniform Memory Access (UMA) shared memory– Every processor has equal access in equal time to all banks of

the main memory

• Cache coherent– Multiple copies of variable maintained consistent by hardware


SMP - UMA

57


58

SMP Node Diagram

USBPeripherals

JTAG

MP

L1L2

MP

L1L2

L3

MP

L1L2

MP

L1L2

L3

M1 M1 Mn-1

Controller

S

S

NIC NIC

Legend : MP : MicroProcessorL1,L2,L3 : CachesM1.. : Memory BanksS : StorageNIC : Network Interface Card

Ethernet

PCI-e


DSM - NUMA

59


Challenges to Computer Architecture• Expose and exploit extreme fine-grain parallelism

– Possibly multi-billion-way (for Exascale)– Data structure-driven (use meta-data parallelism)

• State storage takes up much more space than logic– 1:1 flops/byte ratio infeasible– Memory access bandwidth is the critical resource

• Latency – can approach a million cycles (10,000 or more cycles, typical)– All actions are local– Contention due to inadequate bandwidth

• Overhead for fine grain parallelism must be very small – or system can not scale– One consequence is that global barrier synchronization is untenable

• Power consumption• Reliability

– Very high replication of elements– Uncertain fault distribution– Fault tolerance essential for good yield

• Design complexity– Impacts development time, testing, power, and reliability

60


61




Multi-Core

• Motivation for Multi-Core– Exploits increased feature-size and density– Increases functional units per chip (spatial efficiency)– Limits energy consumption per operation– Constrains growth in processor complexity

• Challenges resulting from multi-core– Relies on effective exploitation of multiple-thread parallelism

• Need for parallel computing model and parallel programming model– Aggravates memory wall

• Memory bandwidth– Way to get data out of memory banks– Way to get data into multi-core processor array

• Memory latency• Fragments L3 cache

– Pins become strangle point• Rate of pin growth projected to slow and flatten• Rate of bandwidth per pin (pair) projected to grow slowly

– Requires mechanisms for efficient inter-processor coordination• Synchronization• Mutual exclusion• Context switching

62


IBM Blue Gene/L

63


Intel Core i7

64


AMD Quad Core Architecture

65

AMD quad-core x86 Opteron processor layout


66


IBM/SONY Cell Architecture

• Product of the “STI” alliance: SCEI (Sony), Toshiba and IBM

• Budget estimate ~$400 mil• Primary design center in Austin, TX

(March 2001)• Modified POWER4 toolchain

• The effort took 4 years, with over 400 engineers and 11 IBM centers involved

• Original target applications:

– Sony Playstation 3– IBM blade server– Toshiba HDTV

67


Cell Processor in Numbers

• 234 mil transistors• 221mm2 die on 90nm process• SOI, low-k dielectrics, copper interconnects• 3.2GHz clock speed (over 5Ghz in lab)• Peak performance:

– over 256Gflops @4GHz, single precision– ~26Gflops, double precision– memory bandwidth: 25.6Gbytes/s– I/O bandwidth: 76.8Gbytes/s (48.8 outbound, 32

inbound)• Power consumption undisclosed, estimated at 30W

(MacWorld) or 50-80W (other sources); 5 power states

68


Internal Structure

69


Cell Components and Layout

• One Power Processing Element (PPE)

• Multiple Synergistic Processing Elements (SPE)

• Element Interconnect Bus (EIB)

• Dual channel XDR memory controller

• FlexIO external I/O interface

70


Conventional Strategies to Address the Multi-Core Challenge

• Maintain status quo– Investment in current code stack– Investment in core design

• Increase L2/L3 cache size– Attempt to exploit existing temporal locality

• Increase chip I/O bandwidth– Reduce contention– Eventually embedded optical interfaces chip-to-chip

• Memory bandwidth aggregation through “weaver” chip– Balances processor data demand with memory supply rate– Enables and coordinates multiple overlapping memory banks

• Exploit job stream parallelism– Independent jobs

• O/S scheduling

– Concurrent parametric processes• Multiple instances of same job across parametric set• e.g., Condor

– Coarse grain communicating sequential processes• Message passing; e.g., MPI• Barrier synchronization

71


Limitations of Conventional Incremental Approaches to MultiCore

• Its not just SMP on a chip– Cores on wrong side of the pins– Users expect to see performance gain on existing applications

• Highly sensitive to temporal locality– Fragile in the presence of memory latency– Uses up majority of chip area on caching

• Emphasizes ALU as precious resource– ALU low spatial cost – Memory bandwidth is pacing element for data intensive problems

• Low effective energy usage– Suffers from core complexity

• Does not address intrinsic problems of low efficiency– Just hoping to stay even with Moore’s Law– Single digit sustained/peak performance– Bad when ALU is critical path element

• The Memory Wall is getting Worse!

72

1997 1999 2001 2003 2006 2009

X-Axis

0.1

1

10

100

1000

Tim

e (n

s)

0

100

200

300

400

500

Mem

ory

to C

PU

Rat

io

CPU Clock Period (ns)Memory System Access Time

Ratio


Commodity Clusters

• Distributed Memory systems

• Superior performance to cost

• Dominant parallel systems architecture on the Top 500 List

• Combines off the shelf systems in scalable structure

• Employs commercial high-bandwidth networks for integration

• Message Passing programming model used (e.g. MPI)

• First cluster on Top500 : Berkley Now, 1997

73


74




75

Summary – Material for the Test

• HPC System Stack – slide 5• Performance factors : Technology speed – slide 7• Performance factors : Parallelism –slide 9• Sources of Performance Degradation – slide 10• Computer architecture – slides 12-16• Parallel Structures – slide 18• Performance issues of parallel structures – slide 19• Scalability – slide 21• Performance Metrics – slide 22• Basic uni-processor architecture elements – slide 24• Multiprocessor architecture slides – slides 25 • MPP systems – slides 26,27• Pipeline structures – slides 34,35• Vector processors – slides 36,37• Parallel vector processors (PVP) – slides 40, 41• SIMD – slides 46, 47• Challenges to computer architecture – slides 60


76

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

COMMODITY CLUSTERS

Prof. Thomas Sterling

Department of Computer Science

Louisiana State University

January 25, 2011

http://www.csc.lsu.edu/


Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

2



Topics





• SMP Nodes




• Networks






3



4

What is a Commodity Cluster

• It is a distributed/parallel computing system

• It is constructed entirely from commodity subsystems

– All subcomponents can be acquired commercially and separately

– Computing elements (nodes) are employed as fully operational

standalone mainstream systems

• Two major subsystems:

– Compute nodes

– System area network (SAN)

• Employs industry standard interfaces for integration

• Uses industry standard software for majority of services

• Incorporates additional middleware for interoperability among

elements

• Uses software for coordinated programming of elements in parallel



5



6

Earth Simulator and

TSUBAME



7

Red Sky

• One of the largest clusters in the

world (located in Sandia National

Laboratories, USA)

• Sun Blade x6275 system family

• 41616 Cores

• Intel EM64T Xeon X55xx (Nehalem-

EP) 2930 MHz (11.72 GFlops)

• 22104 GB main memory

• Number 10 on TOP500

• Infiniband interconnection

• Peak perforamnce:

487 Tflops

• R_max:

423 Tflops



8

Commodity Clusters vs “Constellations”

16X16X

16X 16X

System Area Network

64 Processor Constellation

64 Processor Commodity Cluster

4X

4X

4X

4X

4X 4X 4X 4X

4X

4X

4X

4X

4X 4X 4X 4X

System Area Network

• An ensemble of N nodes each comprising p computing elements

• The p elements are tightly bound shared memory (e.g., smp, dsm)

• The N nodes are loosely coupled, i.e., distributed memory

• p is greater than N

• Distinction is which layer gives us the most power through parallelism



9

Columbia

• NASA’s largest computer

• NASA Ames Research Center

• A Constellation

– 20 nodes

– SGI Altix 512 processor nodes

– Total: 10,240 Intel Itanium-2

processors

• 400 Terabytes of RAID

• 2.5 Petabytes of silo farm tape

storage



Topics





• SMP Nodes




• Networks






10



11

A Brief History of Clusters

• 1957 – SAGE by IBM & MIT-LL for Airforce NORAD

• 1976 -- Ethernet

• 1984 – Cluster of 160 Apollo workstations by NSA

• 1985 – M31 Andromeda by DEC, 32 VAX 11/750

• 1986 – Production Condor cluster operational

• 1990 – PVM released

• 1993 – First NOW workstation cluster at UC Berkeley

• 1993 – Myrinet introduced

• 1994 – First Beowulf PC cluster at NASA Goddard

• 1994 – MPI standard

• 1996 – >1Gflops

• 1997 – Gordon Bell Prize for Price-Performance

• 1997 – Berkeley NOW first cluster on Top-500

• 1997 -- >10 Gflops

• 1998 – Avalon by LANL on Top500 list

• 1999 -- >100 Gflops

• 2000 – Compaq and PSC awarded 5 Tflops by NSF



12

UC-Berkeley NOW Project

• NOW-1 1995

• 32-40 SparcStation 10s and

20s

• originally ATM

• first large myrinet network

NOW-2 1997

100+ Ultra Sparc 170s

128 MB, 2 2GB disks, ethernet, myrinet

largest Myrinet configuration in the world

First cluster on the TOP500 list



13

NOW Accomplishments

• Early prototypes in 1993 & 1994

• First Inktomi

• Complete Glunix + virtual network environment– able to page many processes onto dedicated

user-level network resources

• NPACI production resource since 1998

• Active Messages demonstrates user level communication in full Unix environment

• First cluster on the TOP500 list

• Set all Parallel Disk-disk sort records (2 yrs)– 500 MB/s disk bandwidth

– 1,000 MB/s network bandwidth

• Basis for studies in novel OS structures

Minute Sort

SGI Power

Challenge

SGI Orgin

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

ab

yte

s s

ort

ed



14

NASA Beowulf Project

Wiglaf - 1994

16 Intel 80486 100 MHz

VESA Local bus

256 Mbytes memory

6.4 Gbytes of disk

Dual 10 base-T Ethernet

72 Mflops sustained

$40K

Hrothgar - 1995

16 Intel Pentium100 MHz

PCI

1 Gbyte memory

6.4 Gbytes of disk

100 base-T Fast Ethernet

(hub)

240 Mflops sustained

$46K

Hyglac-1996 (Caltech)

16 Pentium Pro 200 MHz

PCI

2 Gbytes memory

49.6 Gbytes of disk

100 base-T Fast Ethernet

(switch)

1.25 Gflops sustained

$50K



15

Beowulf Accomplishments

• An experiment in parallel computing systems

• Established vision low-cost HPC

• Demonstrated effectiveness of PC clusters for some classes of applications

• Provided networking software in Linux

• Mass Storage with PVFS

• Provided cluster management tools

• Achieved >10 Gflops performance

• Gordon Bell Prize for Price-Performance

• Conveyed findings to broad community

• Tutorials and the book

• Provided design standard to rally community

• Spin-off of Scyld Computing Corp.

Hive at GSFC

Naegling at Caltech CACR



16

“Do it Yourself Supercomputers”

• Synthesis of just-ready hardware/software elements

• Narrow window of opportunity

• PCs just capable of a few Mflops

• Ethernet LAN (10 base-T) just cheap enough

• A cost constrained requirement with funding

• An open source Unix, albeit immature

• Experience with clustering

• A stable message passing library

• Talent availability to fill the gaps

• Willingness to win or fail

• Modest and well defined goals, vision, and path



17

Dominance of Clusters in HPC

• Every major HPC vendor (but 1) has a

cluster product

– IBM

– HP

– SUN

– NEC

– Fujitsu

– SGI

– Cray

• Additional vendors dedicated to clusters

– Penguin

– Dell



Topics





• SMP Nodes




• Networks






18



19

Clusters Dominate Top-500



20

Why are Clusters so Prevalent

• Excellent performance to cost for many workloads– Exploits economy of scale

• Mass produced device types

• Mainstream standalone subsystems

– Many competing vendors for similar products

• Just in place configuration– Scalable up and down

– Flexible in configuration

• Rapid tracking of technology advance– First to exploit newest component types

• Programmable– Uses industry standard programming languages and tools

• User empowerment• Low cost, ubiquitous systems

• Programming systems make it relatively easy to program for expert users



21

1st printing: May, 1999

2nd printing: Aug. 1999

MIT Press



Topics





• SMP Nodes




• Networks






22



23

What You Need to Know about Clusters

• Key system elements

– SMP Node

– Interconnect Networks

– Operating Systems

– Resource Management / Scheduling systems

• Programming & Runtime environment

– Message-passing/Cooperative programming model

– Programming languages & compilers, debuggers

• Performance Measurement & Profiling

– How is performance effected

– How to measure how well the applications behave

– How to optimize application behavior



24

Key Parameters for Cluster Computing

• Peak floating point performance

• Sustained floating point performance

• Main memory capacity

• Bi-section bandwidth

• I/O bandwidth

• Secondary storage capacity

• Organization– Processor architecture

– # processors per node

– # nodes

– Accelerators

– Network topology

• Logistical Issues– Power Consumption

– HVAC / Cooling

– Floor Space (Sq. Ft)



25

Where’s the Parallelism

• Inter-node

– Multiple nodes

– Primary level for commodity clusters

– Secondary level for constellations

• Multi socket, intra-node

– Routinely 1, 2, 4, 8

– Heterogeneous computing with accelerators

• Multi-core, intra-socket

– 2, 4 cores per socket

• Multi-thread, intra-core

– None or two usually

• ILP, intra-core

– Multiple operations issued per instruction

• Out of order, reservation stations

• Prefetching

• Accelerators



26

Cluster System

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

Resource management & scheduling subsystem

Login & Cluster Access

Co

mp

ute

No

des

Interco

nn

ect N

etwo

rk



27

Constituent Hardware Elements

• Compute Nodes (“nodes”)

– Standalone mainstream products

– Processors and accelerators

– Memory and caches

– Chip set

– Interfaces

• System Area Network(s)

– Network interface controllers (NIC)

– Switches

– Cables

• External I/O

– File system

– Internet access

– User interface

– Management and administration



Topics





• SMP Nodes




• Networks






28



29

Microprocessor Clock Rate



Technology Trends

30



31

Compute Node Diagram

MP

L1L2

MP

L1L2

L3

MP

L1L2

MP

L1L2

L3

M0 M1 Mn-1

Controller

S

S

NIC NICUSBPeripherals

JTAG

Legend : MP : MicroProcessorL1,L2,L3 : CachesM1.. : Memory BanksS : StorageNIC : Network Interface Card

Ethernet

PCI-e



Arete Node Picture

32



33

Parameters for Cluster Nodes

• Processor architecture family (AMD Opteron, Intel Xeon, IBM Power)• Number of processor chips (2)• Number of processor cores per chip (multicore) (3-4)• Memory capacity per processor chip (2 GBytes per core)• Processor core clock rate (3)

– GHz

• Operations per instruction issue, ILP (2 – 4 floating point operations)• Cache size per core (L1, L2, L3)• Distributed or shared memory (SMP) structure

– Cache coherent?

• Number and class of network ports• Latency to main memory (100 – 400 cycles)

– Measured in processor clock cycles

• Disk spindles and capacity (0, 1, or 2)• Ancillary I/O ports• Packaging issues

– Power– Size (1 to 4 u) (http://en.wikipedia.org/wiki/Rack_unit)– Cost



Topics





• SMP Nodes




• Networks






34



35

The History of Linux

• Started out with Linus' frustation on available affordable operating

systems for the PC

• He put together a rudimentary scheduler, and later added on more

features until he could bootstrap the kernel (1991).

• The source was released on the internet in hope that more people

would contribute to the kernel

• GCC was ported, a C library was added and a primitive serial and

tty driver code

• Networks, file systems were added

• Slackware

• RedHat

• Extreme Linux



36

Open Source Software

• Evolution of PC Clusters has benefited from Open Source Software

• Early examples

– Gnu compiler tools, FreeBSD, Linux, PVM

• Advantages

– Provides shared infrastructure – avoids duplication of effort

– Permits wide collaborations

– Facilitates exploratory studies and innovation

• Free software is not necessarily OSS

• Business model in state of flux: how to fund free deliverables

• Important synergy between OSS standard infrastructure software and

proprietary ISV target-specific software:

– OSS provides common framework

– For-profit software provides incentive and resources


CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011 37

Linux DistributionsAlphanet Linux

Alzza Linux

Andrew Linux

Apokalypse

Armed Linux

ASPLinux

Bad Penguin

Bastille Linux

Best Linux

BlackCat Linux

Blue Linux

Bluecat Linux

BluePoint Linux

Brutalware

Caldera OpenLinux

Cclinux

ChainSaw Linux

CLEClIeNUX

Conectiva

CoolLinux

Coyote Linux

Corel

COX-Linux

Darkstar Linux

Debian Definite

Linux

deepLINUX

Delix

Dlite (Debian Lite)

DragonLinux

Eagle Linux M68K

easyLinux

Elfstone Linux

Embedix

Enoch

Eonova Linux

ESware

Etlinux

Eurielec Linux

FinnixFloppi Gentoo

Linux

Gentus Linux

Green Frog Linux

Halloween Linux

Hard Hat Linux

HispaFuentes

HVLinux

Icepack

Immunix

OSIndependence

InfoMagick Workgroup

Server

Ivrix

ix86 Linux

JBLinux

Jurix Linux

Kondara

Krud

KW Linux

KSI Linux

L13Plus

Laser5

Leetnux

Lightening

Linpus Linux

Linux Antarctica

Linux by Linux

Linux GT Server Edition

Linux Mandrake

Linux MX

LinuxOne

LinuxPPC

LinuxPPP

LinuxSIS

LinuxWare

Linux-YeS

LNX System

Lunet

LuteLinux

LST

Mastodon

MaxOS&trade

MIZI Linux OS

MkLinux

MNIS Linux

MicroLinux

Monkey Linux

NeoLinux

Newlix OfficeServer

NoMad Linux

Ocularis

Open Kernel Linux

Open Share Linux

OS2000

Peanut Linux

PhatLINUX

PingOO

Plamo Linux

Platinum Linux

Power Linux

Progeny Debian

Project Freesco

Prosa Debian

Pygmy Linux

Red Flag Linux

Red Hat Linux

Redmond Linux

Rock Linux

RT-Linux

Scrudge Ware

Secure Linux

Skygate Linux

Slacknet Linux

Slackware

Slinux

SOT Linux

Spiro

Stampede Linux

Storm Linux

S.u.SE

Thin Linux

TINY Linux

Trinux

Trustix Secure Linux

TurboLinux

Turquaz

UltraPenguin

Ute-Linux

VA-enhanced RedHat Linux

VectorLinux

Vedova Linux

Vine Linux

White Dwarf Linux

Whole Linux

WinLinux 2000

WorkGroup Solutions

Linux Pro Plus

Xdenu

Xpresso Linux 2000

XTeam Linux

Yellow Dog Linux

Yggdrasil Linux

ZiiF Linux

ZipHam

ZipSlack



38

Operating System

• What is an Operating System?

– A program that controls the execution of application programs

– An interface between applications and hardware

• Primary functionality

– Exploits the hardware resources of one or more processors

– Provides a set of services to system users

– Manages secondary memory and I/O devices

• Objectives

– Convenience: Makes the computer more convenient to use

– Efficiency: Allows computer system resources to be used in an

efficient manner

– Ability to evolve: Permit effective development, testing, and

introduction of new system functions without interfering with service

Source: William Stallings “Operating Systems: Internals and Design Principles (5th Edition)”



39

Services Provided by the OS

• Program development

– Editors and debuggers

• Program execution

• Access to I/O devices

• Controlled access to files

• System access

• Protection

• Error detection and response

– Internal and external hardware errors

– Software errors

– Operating system cannot grant request of application

• Accounting



40

Layers of Computer System



41

Resources Managed by the OS

• Processor

• Main Memory

– volatile

– referred to as real memory or primary memory

• I/O modules

– secondary memory devices

– communications equipment

– terminals

• System bus

– communication among processors, memory, and I/O modules



Topics





• SMP Nodes




• Networks






42



Topics





• SMP Nodes




• Networks






43



44

Programming on Clusters

• Several ways of programming application on clusters− Throughput – jobstream

− Decoupled Work Queue Model – SPMD for parameter studies

− Communicating Sequential Processes (CSP)

− Multi threaded

• Throughput: job stream– PBS, Maui

• Decoupled Work Queue Model : SPMD, e.g. parametric studies– Condor

• Communicating Sequential Processes– Message passing

– Distributed memory

– Global barrier synchronization

– e.g., MPI

• Multi threaded– Limited to intra-node programming

– Shared memory

– e.g., OpenMP



Throughput Computing

• Simplest form of parallel computing

• Separate jobs on separate compute nodes

– Independent tasks on independent nodes

• No intra application / cross node communication

• “job stream” workflow

• Capacity computing

– Distinguished from cooperative and capability computing

– Scaling dependent on number of concurrent jobs

• Performance

– Throughput

– Total aggregate operations per second achieved

• Widely used for servers

45



Decoupled Work Queue Model

• Concurrent disjoint tasks

• Parametric Studies

– SPMD (single program multiple data)

• Very coarse grained

• Example software package : Condor

• Processor farms and clusters

• Throughput Computing Lecture covers this model of

parallelism in greater depth

46



Topics





• SMP Nodes




• Networks






47


CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011 48

Some Node Interconnect Options

• Current Generation

– Gigabit Ethernet (~1000 Mb/s)

– 10 Gigabit Ethernet

– 40 Gigabit Ethernet and 100 Gigabt Ethernet (100GbE)

standards are in draft as of 2009

– Infiniband (IBA)

• Previous Generation

– Fast Ethernet (~100 Mb/s)

– Myricom’s Myrinet-2000 (~1600 Mb/s)

– SCI (~4000 Mb/s)

– OC-12 ATM (~622 Mb/s)

– Fiber Channel (~100 MB/s)

– USB (12 Mb/s)

– Firewire (IEEE 1394 400 Mb/s)



49

Fast and Gigabit Ethernet

• Cost effective

• Lucent, 3com, Cisco, etc.

• Directly leverage LAN technology and market

• Up to 384 100 Mbps ports in one switch

• Switches can be stacked on connected with multiple gigabit links

• 100 Base-T:– Bandwidth: > 11 MB/s

– Latency: < 90 microseconds

• 1000 Base-T:– Bandwidth: ~ 50 MB/s

– Latency: < 90 microseconds



50

Myrinet

• High Performance: 2+2 Gbps

• Low latency: 11 microseconds

• Fiber and copper interconnects

• High Availability – auto reroute

• 4, 8,16 and 64 port switches, stackable

• Scalable to 1000s of hosts



InfiniBand

51

• High Performance: 10 - 20 Gbps

• Low latency: 1.2 microseconds

• Copper interconnects

• High availability - IEEE 802.3ad Link Aggregation / Channel Bonding

http://www.hpcwire.com/hpc/1342206.html



Network Interconnect Topologies

52

TORUS

FAT-TREE (CLOS)



53

Dell PowerEdge SC1435

Opteron, IBA



54

Example: 320-host Clos topology of

16-port switches

64 hosts 64 hosts 64 hosts 64 hosts 64 hosts

(From Myricom)



Arete Infiniband Network

55



Topics





• SMP Nodes




• Networks






56



57

Schedulers : PBS

Workload management system – coordinates resource utilization policy and user job requirements– Multi users, Multi jobs, Multi nodes

• Both Open Source and Commercially supported (Veridian)

• Functionality– Manages parallel job execution

– Interactive and batch cross system scheduling

– Security and access control lists

– Dynamic distribution and automatic load-leveling of workload

– Job and user accounting

• Accomplishments– Runs on all Unix and Linux platforms

– Supports MPI

– First release 1995

– 2000 sites registered, 1000 people on the mailing list

– PBSPro sales at >5000 cpu’s



58

Schedulers : Maui (Moab)

• Cluster Resources Inc.

• Advanced systems software tool for more optimal job

scheduling

• Improved administration and statistical reporting

capabilities

• Analytical simulation capabilities to evaluate different

allocation and prioritization schemes.

• Offers different classes of services to users, allowing

high priority users to be scheduled first, while

preventing long-term starvation of low priority jobs.

• SMP Enabled



59

Schedulers : Condor

• Distributed Task Scheduler

• Emphasis on throughput or capacity computing

• Services

– Automates cycle harvesting and workstation farms

– Distributed time-sharing and batch processing resource

– Exploits opportunitstic versus dedicated resources

– Permit preemptive acquisition of resources

– Transparent checkpointing

– Remote I/O – preserve local execution environment (require relinking)

– Asynchronous process management, master-worker processing

• Accomplishments

– First production system operational in 1986

– U. of Wisconsin 1300 CPU’s Condor controlled on campus

– Used by:

• large software house for bills and testing,

• Xerox printer simulation,

• Core Digital Pictures rendering of movies,

• INFN for high energy physics,

• 250 machines at NAS, half million hours



Topics





• SMP Nodes




• Networks






60



61

MPI Software

• Community wide standard process

– Leveraged experiences with NX, PVM, P4, Zipcode, others

• Dominant programming model for clusters

• Multiple implementations both OSS and commercial (MPI Soft Tech)

– All of MPI-1

– MPI I./O

– All of MPI-2

– MPI-3 under development

• Functionality

– Message passing model for distributed memory platforms

– Support for truly scalable operations (1000s nodes)

• Rich set of collective operations (gathers, reduces, scans, all to all)

• Scalable one sided operations (fence barrier synchronization, group-oriented synchronization)

– Dynamic processes (2) to spawn, disconnect etc. with scalability

• MPICH-2 entirely new rewrite

• OpenMPI includes fault tolerant capability



Topics





• SMP Nodes




• Networks






62



Compilers & Debuggers

• Compilers : – Intel C/ C++ / Fortran

– PGI C/ C++ / Fortran

– GNU C / C++ / Fortran

• Libraries :– Each compiler is linked against MPICH

– Mesh/Grid Partitioning software : METIS etc.

– Math Kernel Libraries (MKL)

– Intel MKL, AMD MKL, GNU Scientific Library (GSL)

– Data format libraries : NetCDF, HDF 5 etc

– Linear Algebra Packages : BLAS, LAPACK etc

• Debuggers– gdb

– Totalview

• Performance & Profiling tools : – PAPI

– TAU

– Gprof

– perfctr

63



Distributed File Systems

• A distributed file system is a file system that is stored locally on one system (server) but is accessible by processes on many systems (clients).

• Multiple processes access multiple files simultaneously.

• Other attributes of a DFS may include :

– Access control lists (ACLs)

– Client-side file replication

– Server- and client- side caching

• Some examples of DFSes:

– NFS (Sun)

– AFS (CMU)

– PVFS (Clemson, Argonne), OrangeFS

– Lustre (Sun)

– GPFS (IBM)

• Distributed file systems can be used by parallel programs, but they have significant disadvantages :

– The network bandwidth of the server system is a limiting factor on performance

– To retain UNIX-style file consistency, the DFS software must implement some form of locking which has significant performance implications

64

Ohio Supercomputer Center



Distributed File System : NFS

• Popular means for accessing remote file

systems in a local area network.

• Based on the client-server model , the remote

file systems are “mounted” via NFS and

accessed through the Linux virtual file system

(VFS) layer.

• NFS clients cache file data, periodically

checking with the original file for any changes.

• The loosely-synchronous model makes for

convenient, low-latency access to shared

spaces.

• NFS avoids the common locking systems used

to implement POSIX semantics.

65



66

Parallel Virtual File System (PVFS)

• Clemson University - 1993

• Objective: high throughput file system – DOE, NASA, (GPL)

• Strategy:

– exploit parallelism of bandwidth

– provide user interface so that applications can make powerful requests such as large collection of non-contiguous data with single request for multidimensional data sets,

– allow application direct access to server:

• multiple application tasks directly access/spawn multiple file servers without going through kernel or central mechanism.

• N-clients and N-servers

• Single file spread across multiple disks and nodes and accessed by multiple tasks in an application.

• Scaling facilitated by eliminating single bottleneck

• Actual distribution of a file is configurable on a file by file basis.

• Reactive scheduling addresses problem of network contention and adaptive to file system load



Topics





• SMP Nodes




• Networks






67



Measuring Performance on Clusters

• Ways of measuring performance– Wall clock time

– Benchmarks

– Processor efficiency factors

– Scalability

– MPI communications and synchronization overhead

– System operations

• Tools– PAPI

– Tau

– Ganglia

– Many others

68



MPI Performance Measurement : VAMPIR

69src : http://mumps.enseeiht.fr/



MPI Performance : Tau

70

src : http://www.cs.uoregon.edu/research/tau



Topics





• SMP Nodes




• Networks






71



72


• What is a commodity cluster – slide 4

• Commodity clusters vs “Constellations” – slide 8

• Key parameters for cluster computing – slide 24

• Where is the parallelism – slide 25

• Parameters for cluster nodes – slide 33

• Node operating system – slide 38,39,40,41

• Programming clusters – slide 44

• Throughput computing – slide 45

• Decoupled work queue model – slide 46

• Interconnect options – slide 48

• Scheduling systems – slide 57, 58, 59

• Message passing : MPI software – slide 61

• Distributed file systems – slide 64

• Measuring performance on cluster: Metrics & Tools – slide 68



73


CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011


CAPACITY COMPUTING

Prof. Thomas Sterling Department of Computer Science Louisiana State University February 1, 2011


Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

2


Topics


3


Key Terms and Concepts

4

Problem

instructions

CPU

Conven1onal serial execu+on where the problem is represented as a series of instruc1ons that are executed by the CPU (also sequen+al execu+on)

CPU CPU CPU CPU

instructions

Task Task Task Task Problem

Parallel execu+on of a problem involves par11oning of the problem into mul1ple executable parts that are mutually exclusive and collec1vely exhaus1ve represented as a par1ally ordered set exhibi1ng concurrency.

Parallel compu1ng takes advantage of concurrency to : •  Solve larger problems within

bounded 1me •  Save on Wall Clock Time •  Overcoming memory constraints •  U1lizing non-‐local resources


Key Terms and Concepts •  Scalable Speedup : Relative reduction of execution time of a fixed

size workload through parallel execution

•  Scalable Efficiency : Ratio of the actual performance to the best possible performance.

5

Speedup = execution_ time_on_one_ processorexecution_ time_on_N _ processors

Efficiency = execution_ time_on_one_ processorexecution_ time_on_N _ processors!N


Topics


6


Defining the 3 C’s … •  Main Classes of computing :

–  High capacity parallel computing : A strategy for employing distributed computing resources to achieve high throughput processing among decoupled tasks. Aggregate performance of the total system is high if sufficient tasks are available to be carried out concurrently on all separate processing elements. No single task is accelerated. Uses increased workload size of multiple tasks with increased system scale.

–  High capability parallel computing : A strategy for employing tightly coupled structures of computing resources to achieve reduced execution time of a given application through partitioning into concurrently executable tasks. Uses fixed workload size with increased system scale.

–  Cooperative computing : A strategy for employing moderately coupled ensemble of computing resources to increase size of the data set of a user application while limiting its execution time. Uses a workload of a single task of increased data set size with increased system scale.

7


Strong Scaling Vs. Weak Scaling

8

Machine Scale (# of nodes)

Wor

k pe

r tas

k

Weak Scaling

Strong Scaling

1 2 4 8


Strong Scaling, Weak Scaling

9

Strong Scaling Weak Scaling

Strong Scaling

Weak Scaling

Tota

l Pro

blem

Siz

e

Machine Scale (# of nodes)

Gra

nula

rity

(siz

e / n

ode)


Defining the 3 C’s … •  High capacity computing systems emphasize the

overall work performed over a fixed time period. Work is defined as the aggregate amount of computation performed across all functional units, all threads, all cores, all chips, all coprocessors and network interface cards in the system.

•  High capability computing systems emphasize improvement (reduction) in execution time of a single user application program of fixed data set size.

•  Cooperative computing systems emphasize single application weak scaling –  Performance increase through increase in problem size

(usually data set size and # of task partitions) with increase in system scale

10

Adapted from : High-performance throughput computing S Chaudhry, P Caprioli, S Yip, M Tremblay - IEEE Micro, 2005 - doi.ieeecomputersociety.org


Strong Scaling, Weak Scaling

11

Weak Scaling Strong Scaling

Capacity Capability Cooperative Single Job

Workload Size Scaling

•  Capability •  Primary scaling is decrease in response time proportional to increase in resources

applied •  Single job, constant size – goal: response-time scaling proportional to machine size •  Tightly-coupled concurrent tasks making up single job

•  Cooperative •  Single job, (different nodes working on different partitions of the same job) •  Job size scales proportional to machine •  Granularity per node is fixed over range of system scale •  Loosely coupled concurrent tasks making up single job

•  Capacity •  Primary scaling is increase in throughput proportional to increase in resources

applied •  Decoupled concurrent tasks, each a separate job, increasing in number of instances

– scaling proportional to machine.


Topics


12


Models of Parallel Processing •  Conventional models of parallel processing

–  Decoupled Work Queue (covered in segment 1 of the course) –  Communicating Sequential Processing (CSP message passing)

(covered in segment 2) –  Shared memory multiple thread (covered in segment 3)

•  Some alternative models of parallel processing –  SIMD

•  Single instruction stream multiple data stream processor array –  Vector Machines

•  Hardware execution of value sequences to exploit pipelining –  Systolic

•  An interconnection of basic arithmetic units to match algorithm –  Data Flow

•  Data precedent constraint self-synchronizing fine grain execution units supporting functional (single assignment) execution

13


Shared memory multiple Thread

•  Static or dynamic •  Fine Grained •  OpenMP •  Distributed shared memory systems •  Covered in Segment 3

14

Network

CPU 1 CPU 2 CPU 3

Orion JPL N

ASA

memory memory memory

Network

CPU 1 CPU 2 CPU 3


Symmetric Mul1 Processor (SMP usually cache coherent )

Distributed Shared Memory (DSM usually cache coherent)


Communicating Sequential Processes

•  One process is assigned to each processor

•  Work done by the processor is performed on the local data

•  Data values are exchanged by messages

•  Synchronization constructs for inter process coordination

•  Distributed Memory •  Coarse Grained •  MPI application programming interface •  Commodity clusters and MPP

–  MPP is acronym for “Massively Parallel Processor”

•  Covered in Segment 2

15

Network

CPU 1 CPU 2 CPU 3


Distributed Memory (DM oRen not cache coherent)

QueenBee


Decoupled Work Queue Model

•  Concurrent disjoint tasks –  Job stream parallelism –  Parametric Studies

•  SPMD (single program multiple data)

•  Very coarse grained •  Example software package : Condor •  Processor farms and commodity clusters •  This lecture covers this model of parallelism

16


Topics


17


Ideal Speedup Issues

18

•  W is total workload measured in elemental pieces of work (e.g. operations, instructions, subtasks, tasks, etc.)

•  T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads)

•  wi is work for a given task I, measured in operations •  Example: here we divide a million (really Mega)

operation workload, W, into a thousand tasks, w1 to w1024 each of a 1 K operations

•  Assume 256 processors performing workload in parallel •  T(256) = 4096 steps, speedup = 256, Eff = 1


Ideal Speedup Example

19

W

220

w1 w210 210

P28

210 210 210 210

Processors

212

P1

T(1)=220 T(28)=212

!

Speedup =220

212= 28

!

Efficiency =220

212 " 28= 20 =1

Units : steps

W = wii!


Granularities in Parallelism Overhead

•  The additional work that needs to be performed in order to manage the parallel resources and concurrent abstract tasks that is in the critical time path.

Coarse Grained •  Decompose problem into large independent

tasks. Usually there is no communication between the tasks. Also defined as a class of parallelism where: “relatively large amounts of computational work is done between communication”

Fine Grained •  Decompose problem into smaller inter-

dependent tasks. Usually these tasks are communication intensive. Also defined as a class of parallelism where: “relatively small amounts of computational work are done between communication events” –www.llnl.gov/computing/tutorials/parallel_comp

20

Images adapted from : http://www.mhpcc.edu/training/workshop/parallel_intro/

Overhead

Computa1on

Coarse Grained

Overhead

Computa1on

Finely Grained


Overhead

21

•  Overhead: Additional critical path (in time) work required to manage parallel resources and concurrent tasks that would not be necessary for purely sequential execution

•  V is total overhead of workload execution •  vi is overhead for individual task wi

•  Each task takes vi +wi time steps to complete •  Overhead imposes upper bound on scalability


Overhead

22

v w

V+W=4v+4w

wi =WP

Tn = v +Wn

S = T1TP

=W +!WP+!

!W

WP+!

=P

1+ P"!W

=P

1+ !WP

v = overhead V = Total overhead w = work unit W = Total work Ti = execu1on 1me with i processors P = # processors

W = wii=1

P

!Assump1on : Workload is

infinitely divisible


Scalability and Overhead for fixed sized work tasks

23

•  W is divided in to J tasks of size wg •  Each task requires v overhead work to manage •  For P processors there are approximates J/P tasks to be

performed in sequence so, •  TP is J(wg + v)/P •  Note that S = T1 / TP

•  So, S = P / (1 + v / wg)


Scalability & Overhead

24

J = # tasks = Wwg

!

"""

#

$$$%Wwg

T1 =W + v %W

TP =JP& wg + v( ) = W

Pwg

& (wg + v) =WP1+ v

wg

'

())

*

+,,

TP =WP1+ v

wg

'

())

*

+,,

S = T1TP

-W

WP1+ v

wg

'

())

*

+,,

=P

1+ vwg

when W >> v

v = overhead wg = work unit W = Total work Ti = execu1on 1me with i processors P = # Processors J = # Tasks


Topics


25


Capacity Computing with basic Unix tools

•  Combination of common Unix utilities such as ssh, scp, rsh, rcp can be used to remotely create jobs (to get more information about these commands try man ssh, man scp, man rsh, man rcp on any Unix shell)

•  For small workloads it can be convenient to translate the execution of the program into a simple shell script.

•  Relying on simple Unix utilities poses several application management constraints for cases such as : –  Aborting started jobs –  Querying for free machines –  Querying for job status –  Retrieving job results –  etc..

26


BOINC , Seti@Home •  BOINC (Berkley Open Infrastructure for Network Computing) •  Opensource software that enables distributed coarse grained

computations over the internet. •  Follows the Master-Worker model, in BOINC : no

communication takes place among the worker nodes •  SETI@Home •  Einstein@Home •  Climate prediction •  And many more…

27


Topics


28


Management Middleware : Condor

•  Designed, developed and maintained at University of Wisconsin Madison by a team lead by Miron Livny

•  Condor is a versatile workload management system for managing pool of distributed computing resources to provide high capacity computing.

•  Assists distributed job management by providing mechanisms for job queuing, scheduling, priority management, tools that facilitate utilization of resources across Condor pools

•  Condor also enables resource management by providing monitoring utilities, authentication & authorization mechanisms, condor pool management utilities and support for Grid Computing middleware such as Globus.

•  Condor Components •  ClassAds •  Matchmaker •  Problem Solvers

29


Condor Components : Class Ads •  ClassAds (Classified Advertisements) concept is

very similar to the newspaper classifieds concepts where buyers and sellers advertise their products using abstract yet uniquely defining named expressions. Example : Used Car Sales

•  ClassAds language in Condor provides well defined means of describing the User Job and the end resources ( storage / computational ) so that the Condor MatchMaker can match the job with the appropriate pool of resources.


Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and

Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. http://www.cs.wisc.edu/condor/doc/condor-practice.pdf

30


Job ClassAd & Machine ClassAd

31


Condor MatchMaker •  MatchMaker, a crucial part of the Condor

architecture, uses the job description classAd provided by the user and matches the Job to the best resource based on the Machine description classAd

•  MatchMaking in Condor is performed in 4 steps : 1.  Job Agent (A) and resources (R) advertise themselves. 2.  Matchmaker (M) processes the known classAds and

generates pairs that best match resources and jobs 3.  Matchmaker informs each party of the job-resource pair of

their prospective match. 4.  The Job agent and resource establish connection for further

processing. (Matchmaker plays no role in this step, thus ensuring separation between selection of resources and subsequent activities)


Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and

Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.

http://www.cs.wisc.edu/condor/doc/condor-practice.pdf

32


Topics


33


Condor Problem Solvers •  Master-Worker (MW) is a problem solving system that is

useful for solving a coarse grained problem of indeterminate size such as parameter sweep etc.

•  The MW Solver in Condor consists of 3 main components : work-list, a tracking module, and a steering module. The work-list keeps track of all pending work that master needs done. The tracking module monitors progress of work currently in progress on the worker nodes. The steering module directs computation based on results gathered and the pending work-list and communicates with the matchmaker to obtain additional worker processes.

•  DAGMan is used to execute multiple jobs that have dependencies represented as a Directed Acyclic Graph where the nodes correspond to the jobs and edges correspond to the dependencies between the jobs. DAGMan provides various functionalities for job monitoring and fault tolerance via creation of rescue DAGs.


34

Master

w1 w..N



Indepth Coverage : h^p://www.cs.wisc.edu/condor/publica1ons.html

Recommended Reading : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Compu1ng in Prac1ce: The Condor Experience"

Concurrency and Computa+on: Prac+ce and Experience, Vol. 17, No. 2-‐4, pages 323-‐356, February-‐April, 2005. [PDF]

Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny, "Condor -‐ A Distributed Job Scheduler", in Thomas Sterling, editor, Beowulf Cluster Compu+ng with Linux, The MIT Press, 2002.

ISBN: 0-‐262-‐69274-‐0 [Postscript] [PDF]

35


Core components of Condor •  condor_master: This program runs constantly and ensures that all other parts of Condor

are running. If they hang or crash, it restarts them. •  condor_collector: This program is part of the Condor central manager. It collects

information about all computers in the pool as well as which users want to run jobs. It is what normally responds to the condor_status command. It's not running on your computer, but on the main Condor pool host (Arete head node).

•  condor_negotiator: This program is part of the Condor central manager. It decides what jobs should be run where. It's not running on your computer, but on the main Condor pool host (Arete head node).

•  condor_startd: If this program is running, it allows jobs to be started up on this computer--that is, Arete is an "execute machine". This advertises Arete to the central manager (more on that later) so that it knows about this computer. It will start up the jobs that run.

•  condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, desktron is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started.

•  condor_shadow For each job that has been submitted from this computer (e.g., desktron), there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance. You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out.

36

Source : http://www.cs.wisc.edu/condor/tutorials/cw2005-condor/intro.html


Condor : A Walkthrough of Condor commands

condor_status : provides current pool status condor_q : provides current job queue condor_submit : submit a job to condor pool condor_rm : delete a job from job queue

37


What machines are available ? (condor_status)

condor_status queries resource information sources and provides the current status of the condor pool of resources

38

§  Some common condor_status command line options : §  -‐help : displays usage informa1on §  -‐avail : queries condor_startd ads and prints informa1on about available

resources §  -‐claimed : queries condor_startd ads and prints informa1on about claimed

resources §  -‐ckptsrvr : queries condor_ckpt_server ads and display checkpoint server

a^ributes §  -‐pool hostname queries the specified central manager (by default queries

$COLLECTOR_HOST) §  -‐verbose : displays en1re classads §  For more op1ons and what they do run “condor_status –help”


condor_status : Resource States

•  Owner : The machine is currently being utilized by a user. The machine is currently unavailable for jobs submitted by condor until the current user job completes.

•  Claimed : Condor has selected the machine for use by other users.

•  Unclaimed : Machine is unused and is available for selection by condor.

•  Matched : Machine is in a transition state between unclaimed and claimed

•  Preempting : Machine is currently vacating the resource to make it available to condor.

39


Example : condor_status

40

[cdekate@celeritas ~]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:23 vm2@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:24 vm3@compute-‐0 LINUX X86_64 Unclaimed Idle 0.010 1964 0+00:45:06 vm4@compute-‐0 LINUX X86_64 Owner Idle 1.000 1964 0+00:00:07 vm1@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:25 vm2@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 1+09:05:58 vm3@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:37:27 vm4@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 0+00:05:07 … … vm3@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:33 vm4@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:34 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 32 3 0 29 0 0 0 Total 32 3 0 29 0 0 0


What jobs are currently in the queue? condor_q

•  condor_q provides a list of job that have been submitted to the Condor pool

•  Provides details about jobs including which cluster the job is running on, owner of the job, memory consumption, the name of the executable being processed, current state of the job, when the job was submitted and how long has the job been running.

41

§  Some common condor_q command line options : §  -‐global : queries all job queues in the pool §  -‐name : queries based on the schedd name provides a queue lis1ng of the named

schedd §  -‐claimed : queries condor_startd ads and prints informa1on about claimed resources §  -‐goodput : displays job goodput sta1s1cs (“goodput is the allocation time when an

application uses a remote workstation to make forward progress.” – Condor Manual)

§  -cputime : displays the remote CPU time accumulated by the job to date... §  For more op1ons run : “condor_q –help”


[cdekate@celeritas ~]$ condor_q -‐-‐ Submitter: celeritas.cct.lsu.edu : <130.39.128.68:40472> : celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 30.0 cdekate 1/23 07:52 0+00:01:13 R 0 9.8 fib 100 30.1 cdekate 1/23 07:52 0+00:01:09 R 0 9.8 fib 100 30.2 cdekate 1/23 07:52 0+00:01:07 R 0 9.8 fib 100 30.3 cdekate 1/23 07:52 0+00:01:11 R 0 9.8 fib 100 30.4 cdekate 1/23 07:52 0+00:01:05 R 0 9.8 fib 100 5 jobs; 0 idle, 5 running, 0 held [cdekate@celeritas ~]$

42

Example : condor_q


How to submit your Job ? condor_submit

•  Create a job classAd (condor submit file) that contains Condor keywords and user configured values for the keywords.

•  Submit the job classAd using “condor_submit” •  Example :

condor_submit matrix.submit •  condor_submit –h provides additional flags

43

[cdekate@celeritas NPB3.2-‐MPI]$ condor_submit -‐h Usage: condor_submit [options] [cmdfile] Valid options: -‐verbose verbose output -‐name <name> submit to the specified schedd -‐remote <name> submit to the specified remote schedd (implies -‐spool) -‐append <line> add line to submit file before processing (overrides submit file; multiple -‐a lines ok) -‐disable disable file permission checks -‐spool spool all files to the schedd -‐password <password> specify password to MyProxy server -‐pool <host> Use host as the central manager to query If [cmdfile] is omitted, input is read from stdin


condor_submit : Example

44

[cdekate@celeritas ~]$ condor_submit fib.submit Submitting job(s)..... Logging submit event(s)..... 5 job(s) submitted to cluster 35. [cdekate@celeritas ~]$ condor_q -‐-‐ Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> : celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 35.0 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 10 35.1 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 15 35.2 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 20 35.3 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 25 35.4 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 30 5 jobs; 5 idle, 0 running, 0 held [cdekate@celeritas ~]$


How to delete a submitted job ? condor_rm

•  condor_rm : Deletes one or more jobs from Condor job pool. If a particular Condor pool is specified as one of the arguments then the condor_schedd matching the specification is contacted for job deletion, else the local condor_schedd is contacted.

45

[cdekate@celeritas ~]$ condor_rm -‐h Usage: condor_rm [options] [constraints] where [options] is zero or more of: -‐help Display this message and exit -‐version Display version information and exit -‐name schedd_name Connect to the given schedd -‐pool hostname Use the given central manager to find daemons -‐addr <ip:port> Connect directly to the given "sinful string" -‐reason reason Use the given RemoveReason -‐forcex Force the immediate local removal of jobs in the X state (only affects jobs already being removed) and where [constraints] is one or more of: cluster.proc Remove the given job cluster Remove the given cluster of jobs user Remove all jobs owned by user -‐constraint expr Remove all jobs matching the boolean expression -‐all Remove all jobs (cannot be used with other constraints) [cdekate@celeritas ~]$


[cdekate@celeritas ~]$ condor_q -‐-‐ Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> :

celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 41.0 cdekate 1/24 15:43 0+00:00:03 R 0 9.8 fib 100 41.1 cdekate 1/24 15:43 0+00:00:01 R 0 9.8 fib 150 41.2 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 200 41.3 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 250 41.4 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 300 5 jobs; 0 idle, 5 running, 0 held [cdekate@celeritas ~]$ condor_rm 41.4 Job 41.4 marked for removal [cdekate@celeritas ~]$ condor_rm 41 Cluster 41 has been marked for removal. [cdekate@celeritas ~]$

46

condor_rm : Example


Creating Condor submit file ( Job a ClassAd )

•  Condor submit file contains key-value pairs that help describe the application to condor.

•  Condor submit files are job ClassAds. •  Some of the common descriptions found in the job

ClassAds are :

47

executable = (path to the executable to run on Condor) input = (standard input provided as a file) output = (standard output stored in a file) log = (output to log file) arguments = (arguments to be supplied to the queue)


DEMO : Steps involved in running a job on Condor.

1.  Creating a Condor submit file 2.  Submitting the Condor submit file to a Condor pool 3.  Checking the current state of a submitted job 4.  Job status Notification

48


Condor Usage Statistics

49


Montage workload implemented and executed using Condor ( Source : Dr. Dan Katz )

•  Mosaicking astronomical images : •  Powerful Telescopes taking high resolution (and highest zoom) pictures of the sky can cover small region over time •  Problem being solved in this project is “stitching” these images together to make a high-resolution zoomed in

snapshot of the sky. •  Aggregate requirements of 140000 CPU hours (~16 years on a single machine) output ranging in the order of 6

TeraBytes

50

Example DAG for 10 input files

mAdd

mBackground

mBgModel

mProject

mDiff

mFitPlane

mConcatFit

Data Stage-in nodes

Montage compute nodes

Data stage-out nodes

Registration nodes

Pegasus

Grid Information Systems

Information about available resources, data

location

Grid

Condor DAGMan

Maps an abstract workflow ��to an executable form

Executes the workflow

MyProxy

User’s grid credentials

h^p://pegasus.isi.edu/


Montage Use By IPHAS: The INT/WFC Photometric H-alpha Survey of the Northern Galactic Plane

(Source : Dr. Dan Katz)

Supernova remnant S147

Nebulosity in vicinity of HII region, IC 1396B, in Cepheus

Crescent Nebula NGC 6888

Study extreme phases of stellar evolu1on that

involve very large mass loss

51


Topics


52


53

•  Throughput computing •  Performance measured as total workload performed over time

to complete •  Overhead factors

–  Start up time –  Input data distribution –  Output result data collection –  Terminate time –  Inter-task coordination overhead (No task coupling)

•  Starvation –  Insufficient work to keep all processors busy –  Inadequate parallelism of coarse grained task parallelism –  Poor or uneven load distribution

Capacity Computing Performance Issues


Topics


54


Summary : Material for the Test •  Key terms & Concepts (4,5,7,8,9,10,11) •  Decoupled work-queue model (16) •  Ideal speedup (18,19) •  Overhead and Scalability (20,21,22,23,24) •  Understand Condor concepts detailed in slides (30,

31,32, 34,35, 36,37) •  Capacity computing performance issues (53) •  Required reading materials :

–  http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf –  Specific pages to focus on : 3-16

55


56

CSC 7600 Lecture 7 : MPI1 Spring 2011


MEANS

MESSAGE PASSING INTERFACE MPI

(PART A)

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityFebruary 8, 2011



2

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test



3

Topics

• Introduction

• MPI Standard





• Deadlock






Opening Remarks

• Context: distributed memory parallel computers

• We have communicating sequential processes, each

with their own memory, and no access to another

process‟s memory

– A fairly common scenario from the mid 1980s (Intel Hypercube)

to today

– Processes interact (exchange data, synchronize) through

message passing

– Initially, each computer vendor had its own library and calls

– First standardization was PVM

• Started in 1989, first public release in 1991

• Worked well on distributed machines

• Next was MPI

4



What you‟ll Need to Know

• What is a standard API

• How to build and run an MPI-1 program

• Basic MPI functions

– 4 basic environment functions

• Including the idea of communicators

– Basic point-to-point functions

• Blocking and non-blocking

• Deadlock and how to avoid it

• Data types

– Basic collective functions

• The advanced MPI-1 material may be required for the

problem set

• The MPI-2 highlights are just for information

5



6

Topics

• Introduction

• MPI Standard





• Deadlock






MPI Standard

• From 1992-1994, a community representing both

vendors and users decided to create a standard

interface to message passing calls in the context of

distributed memory parallel computers (MPPs, there

weren‟t really clusters yet)

• MPI-1 was the result

– “Just” an API

– FORTRAN77 and C bindings

– Reference implementation (mpich) also developed

– Vendors also kept their own internals (behind the API)

7



MPI Standard

• Since then– MPI-1.1

• Fixed bugs, clarified issues

– MPI-2

• Included MPI-1.2

– Fixed more bugs, clarified more issues

• Extended MPI

– New datatype constructors, language interoperability

• New functionality

– One-sided communication

– MPI I/O

– Dynamic processes

• FORTRAN90 and C++ bindings

• Best MPI reference– MPI Standard - on-line at: http://www.mpi-forum.org/

8


http://www.mpi-forum.org/




9

Topics

• Introduction

• MPI Standard





• Deadlock






MPI : Basics

• Every MPI program must contain the preprocessor directive

• The mpi.h file contains the definitions and declarations necessary for

compiling an MPI program.

• mpi.h is usually found in the “include” directory of most MPI

installations. For example on arete:

10

#include "mpi.h"

...

#include “mpi.h”...

MPI_Init(&Argc,&Argv);

...

...

MPI_Finalize();

...



11

MPI: Initializing MPI Environment

Function: MPI_init()

int MPI_Init(int *argc, char ***argv)

Description:Initializes the MPI execution environment. MPI_init() must be called before any other MPI functions can be called and it should be called only once. It allows systems to do any special setup so that MPI Library can be used. argc is a pointer to the number of arguments and argv is a pointer to the argument vector. On exit from this routine, all processes will have a copy of the argument list.

...

#include “mpi.h”

...

MPI_Init(&argc,&argv);...

...

MPI_Finalize();

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Init.html



12

MPI: Terminating MPI Environment

Function: MPI_Finalize()

int MPI_Finalize()

Description:Terminates MPI execution environment. All MPI processes must call this routine before exiting. MPI_Finalize() need not be the last executable statement or even in main; it must be called at somepoint following the last call to any other MPI function.

...

#include ”mpi.h”

...

MPI_Init(&argc,&argv);

...

...

MPI_Finalize();...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Finalize.html



MPI Hello World

• C source file for a simple MPI Hello World

13

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[]){

MPI_Init( &argc, &argv);printf("Hello, World!\n");MPI_Finalize();return 0;

}

Include header files

Initialize MPI Context

Finalize MPI Context



Building an MPI Executable

• Library version

– User knows where header file and library are, and tells compiler

gcc -Iheaderdir -Llibdir mpicode.c –lmpich

• Wrapper version

– Does the same thing, but hides the details from the user

mpicc -o executable mpicode.c

You can do either one, but don't try to do both!

– use "sh -x mpicc -o executable mpicode.c" to figure out the gcc line

For our “Hello World” example on arete use:

mpicc -o hello hello.c

14

gcc -m64 -O2 -fPIC -Wl,-z,noexecstack -o hello hello.c -I/usr/include/mpich2-x86_64 -L/usr/lib64/mpich2/lib -L/usr/lib64/mpich2/lib -Wl,-rpath,/usr/lib64/mpich2/lib -lmpich -lopa -lpthread -lrt

OR



Running an MPI Executable

• Some number of processes are started somewhere

– Again, standard doesn‟t talk about this

– Implementation and interface varies

– Usually, some sort of mpiexec command starts some number of copies

of an executable according to a mapping

– Example:

„mpiexec -n 2 ./a.out’ command runs two copies of ./a.out where the system

specifies number of processes to be 2

– Most production supercomputing resources wrap the mpiexec command with

higher level scripts that interact with scheduling systems such as PBS /

LoadLeveler for efficient resource management and multi-user support

– Sample PBS / Load Leveler job submission scripts :PBS File:

#!/bin/bash#PBS -l walltime=120:00:00,nodes=8:ppn=4cd /home/cdekate/S1_L2_Demos/adc/pwddatePROCS=`wc -l < $PBS_NODEFILE`mpdboot --file=$PBS_NODEFILEmpiexec -n $PROCS ./padcircmpdallexitdate

LoadLeveler File: #!/bin/bash#@ job_type = parallel#@ job_name = SIMID#@ wall_clock_limit = 120:00:00#@ node = 8#@ total_tasks = 32#@ initialdir = /scratch/cdekate/#@ executable = /usr/bin/poe#@ arguments = /scratch/cdekate/padcirc #@ queue

15



Running the Hello World example

• Using mpiexec : • Using PBS

16

mpd &mpiexec -n 8 ./helloHello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!

hello.pbs : #!/bin/bash#PBS -N hello#PBS -l walltime=00:01:00,nodes=2:ppn=4cd /home/cdekate/2008/l7wddatePROCS=`wc -l < $PBS_NODEFILE`mpdboot -f $PBS_NODEFILEmpiexec -n $PROCS ./hellompdallexitdate

more hello.o10030 /home/cdekate/2008/l7Wed Feb 6 10:58:36 CST 2008Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Wed Feb 6 10:58:37 CST 2008



17

Topics

• Introduction

• MPI Standard





• Deadlock






MPI Communicators

• Communicator is an internal object

• MPI Programs are made up of communicating processes

• Each process has its own address space containing its own attributes such as rank, size (and argc, argv, etc.)

• MPI provides functions to interact with it

• Default communicator is MPI_COMM_WORLD

– All processes are its members

– It has a size (the number of processes)

– Each process has a rank within it

– One can think of it as an ordered list of processes

• Additional communicator(s) can co-exist

• A process can belong to more than one communicator

• Within a communicator, each process has a unique rank

MPI_COMM_WORLD

0

12

5

3

4

6

7

18



19

MPI: Size of Communicator

Function: MPI_Comm_size()

int MPI_Comm_size ( MPI_Comm comm, int *size )

Description:Determines the size of the group associated with a communicator (comm). Returns an integer number of processes in the group underlying comm executing the program. If comm is an inter-communicator (i.e. an object that has processes of two inter-communicating groups) , return the size of the local group (a size of a group where request is initiated from). The comm in the argument list refers to the communicator-group to be queried, the result of the query (size of the comm group) is stored in the variable size.

...


...

int size;MPI_Init(&Argc,&Argv);

...

MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

...

err = MPI_Finalize();

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Comm_size.html



20

MPI: Rank of a process in comm

Function: MPI_Comm_rank()

int MPI_Comm_rank ( MPI_Comm comm, int *rank )

Description:Returns the rank of the calling process in the group underlying the comm. If the comm is an inter-communicator, the call MPI_Comm_rank returns the rank of the process in the local group. The first parameter comm in the argument list is the communicator to be queried, and the second parameter rank is the integer number rank of the process in the group of comm.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Comm_rank.html

...


...

int rank;MPI_Init(&Argc,&Argv);

...

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);...

err = MPI_Finalize();

...



Example : communicators

21


int main( int argc, char *argv[]){

int rank, size;MPI_Init( &argc, &argv);MPI_Comm_rank( MPI_COMM_WORLD, &rank);MPI_Comm_size( MPI_COMM_WORLD, &size);printf("Hello, World! from %d of %d\n", rank, size );MPI_Finalize();return 0;

}

Determines the rank of the current process in the communicator-group

MPI_COMM_WORLD

Determines the size of the communicator-group MPI_COMM_WORLD

… Hello, World! from 1 of 8Hello, World! from 0 of 8Hello, World! from 5 of 8…



Example : Communicator & Rank

• Compiling :

• Result :

22

mpicc -o hello2 hello2.c

Hello, World! from 4 of 8Hello, World! from 3 of 8Hello, World! from 1 of 8Hello, World! from 0 of 8Hello, World! from 5 of 8Hello, World! from 6 of 8Hello, World! from 7 of 8Hello, World! from 2 of 8



23

Topics

• Introduction

• MPI Standard





• Deadlock






MPI : Point to Point Communication

primitives

• A basic communication mechanism of MPI between a pair of processes in which one process is sending data and the other process receiving the data, is called “point to point communication”

• Message passing in MPI program is carried out by 2 main MPI functions– MPI_Send – sends message to a designated process

– MPI_Recv – receives a message from a process

• Each of the send and recv calls is appended with additional information along with the data that needs to be exchanged between application programs

• The message envelope consists of the following information– The rank of the receiver

– The rank of the sender

– A tag

– A communicator

• The source argument is used to distinguish messages received from different processes

• Tag is user-specified int that can be used to distinguish messages from a single process

24



Message Envelope

• Communication across processes is performed using messages.

• Each message consists of a fixed number of fields that is used to distinguish them, called the Message Envelope :

– Envelope comprises source, destination, tag, communicator

– Message = Envelope + Data

• Communicator refers to the namespace associated with the group of related processes

25

MPI_COMM_WORLD

0

12

5

3

4

6

7

Source : process0Destination : process1Tag : 1234Communicator : MPI_COMM_WORLD



26

MPI: (blocking) Send message

Function: MPI_Send()

int MPI_Send(

void *message,

int count,

MPI_Datatype datatype,

int dest,

int tag,

MPI_Comm comm )

Description:The contents of message are stored in a block of memory referenced by the first parameter

message. The next two parameters, count and datatype, allow the system to determine how much

storage is needed for the message: the message contains a sequence of count values, each having

MPI type datatype. MPI allows a message to be received as long as there is sufficient storage

allocated. If there isn't sufficient storage an overflow error occurs. The dest parameter corresponds

to the rank of the process to which message has to be sent.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Send.html



MPI : Data Types

MPI datatype C datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

27

You can also define your own (derived datatypes), such as an array of ints of size 100, or more complex examples, such as a struct or an array of structs



MPI: (blocking) Receive message

28

Function: MPI_Recv()

int MPI_Recv(

void *message,

int count,


int source,

int tag,

MPI_Comm comm,

MPI_Status *status )

Description:The contents of message are stored in a block of memory referenced by the first parameter message. The

next two parameters, count and datatype, allow the system to determine how much storage is needed for

the message: the message contains a sequence of count values, each having MPI type datatype. MPI

allows a message to be received as long as there is sufficient storage allocated. If there isnt sufficient

storage an overflow error occurs. The source parameter corresponds to the rank of the process from which

the message has been received. The MPI_Status parameter in the MPI_Recv() call returns information on

the data that was actually received It references a record with 2 fields – one for the source and one for the

tag http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Recv.html



MPI_Status object

29

Object: MPI_Status

Example usage :MPI_Status status;

Description:The MPI_Status object is used by the receive functions to return data about the message, specifically the object contains the id of the process sending the message (MPI_SOURCE), the message tag (MPI_TAG), and error status (MPI_ERROR) .

#include "mpi.h"…

MPI_Status status; /* return status for */…MPI_Init(&argc, &argv);…if (my_rank != 0) {…

MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);}else { /* my rank == 0 */

for (source = 1; source < p; source++ ) {

MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);…MPI_Finalize();

…



MPI : Example send/recv

30

/* hello world, MPI style */

#include "mpi.h"#include <stdio.h>#include <string.h>

int main(int argc, char* argv[]) {int my_rank; /* rank of process */int p; /* number of processes */int source; /* rank of sender */int dest; /* rank of receiver */

int tag=0; /* tag for messages */char message[100]; /* storage for message */MPI_Status status; /* return status for */

/* receive */

/* Start up MPI */MPI_Init(&argc, &argv);

/* Find out process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Find out number of processes */MPI_Comm_size(MPI_COMM_WORLD, &p);

Src : Prof. Amy Apon

if (my_rank != 0) {/* Create message */sprintf(message, "Greetings from process %d!", my_rank);dest = 0;/* Use strlen+1 so that \0 gets transmitted */MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,

MPI_COMM_WORLD);}else { /* my rank == 0 */

for (source = 1; source < p; source++ ) {MPI_Recv(message, 100, MPI_CHAR, source, tag,

MPI_COMM_WORLD, &status);printf("%s\n", message);

}printf("Greetings from process %d!\n", my_rank);

}

/* Shut down MPI */MPI_Finalize();

} /* end main */



Communication map for the example.

31

mpiexec -n 8 ./hello3Greetings from process 1!Greetings from process 2!Greetings from process 3!Greetings from process 4!Greetings from process 5!Greetings from process 6!Greetings from process 7!Greetings from process 0!Writing logfile....Finished writing logfile.[cdekate@celeritas l7]$



32

Topics

• Introduction

• MPI Standard





• Deadlock






Point-to-point Communication

• How two processes interact

• Most flexible communication in MPI

• Two basic varieties

– Blocking and non-blocking

• Two basic functions

– Send and receive

• With these two functions, and the four functions

we already know, you can do everything in MPI

– But there's probably a better way to do a lot things,

using other functions

33



Point to Point Communication :

Basic concepts (buffered)

Kernel modeUser mode

Process 0

Kernel modeUser mode

Process 1

Call send Subroutine

Return from send

Subroutine

Copy data from sendbuf to

sysbuf

Send data to the sysbuf at

the receiving end

Call receive Subroutine

Return from receive

Subroutine

Receive data from the sysbuf

at the sending end

Copy data from sysbuf to

recvbuf

sendbuf

sysbuf

sysbuf

recvbuf

Step 1

Step 2

Step 3

1. Data to be sent by the user is copied from the user memory space to the system buffer

2. The data is sent from the system buffer over the network to the system buffer of receiving process

3. The receiving process copies the data from system buffer to local user memory space

34



MPI communication modes

• MPI offers several different types of communication modes, each having implications on data handling and performance:– Buffered

– Ready

– Standard

– Synchronous

• Each of these communication modes has both blocking and non-blocking primitives– In blocking point to point communication the send call blocks until the send block

can be reclaimed. Similarly the receive function blocks until the buffer has successfully obtained the contents of the message.

– In the non-blocking point to point communication the send and receive calls allow the possible overlap of communication with computation. Communication is usually done in 2 phases: the posting phase and the test for completion phase.

• Synchronization Overhead: the time spent waiting for an event to occur on another task.

• System Overhead: the time spent when copying the message data from sender‟s message buffer to network and from network to the receiver‟s message buffer.

35



Point to Point Communication

Blocking Synchronous Send

• The communication mode is selected while invoking the send routine.

• When blocking synchronous send is executed (MPI_Ssend()) , “ready to send” message is sent from the sending task to receiving task.

• When the receive call is executed (MPI_Recv()), “ready to receive” message is sent, followed by the transfer of data.

• The sender process must wait for the receive to be executed and for the handshake to arrive before the message can be transferred. (Synchronization Overhead)

• The receiver process also has to wait for the handshake process to complete. (Synchronization Overhead)

• Overhead incurred while copying from sender & receiver buffers to the network.

36

http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html




Blocking Ready Send

• The ready mode send call (MPI_Rsend) sends the message over the network

once the “ready to receive” message is received.

• If “ready to receive” message hasn‟t arrived, the ready mode send will incur an

error and exit. The programmer is responsible to provide for handling errors and

overriding the default behavior.

• The ready mode send call minimizes system overhead and synchronization

overhead incurred during sending of the task.

• The receive still incurs substantial synchronization overhead depending on how

much earlier the receive call is executed.

37





Blocking Buffered Send

• The blocking buffered send call (MPI_Bsend()) copies the data from the message buffer

to a user-supplied buffer and then returns.

• The message buffer can then be reclaimed by the sending process without having any

effect on any data that is sent.

• When the “ready to receive” notification is received the data from the user-supplied buffer

is sent to the receiver.

• Replicated copies of the buffer results in added system overhead.

• Synchronization overhead on the sender process is eliminated as the sending process

does not have to wait on the receive call.

• Synchronization overhead on the receiving process can still be incurred, because if the

receive is executed before the send, the process must wait before it can return to the

execution sequence

38





Blocking Standard Send

• The MPI_Send() operation is implementation dependent

• When the data size is smaller than a threshold value (varies for each implementation):

– The blocking standard send call (MPI_Send()) copies the message over the network

into the system buffer of the receiving node, after which the sending process continues

with the computation

– When the receive call (MPI_Recv()) is executed the message is copied from the

system buffer to the receiving task

– The decreased synchronization overhead is usually at the cost of increased system

overhead due to the extra copy of buffers

39





Buffered Standard Send

• When the message size is greater than a threshold– The behavior is same as for the synchronous mode

– Small messages benefit from the decreased chance of synchronization overhead

– Large messages results in increased cost of copying to the buffer and system overhead

40





Non-blocking Calls

• The non-blocking send call (MPI_Isend()) posts a non-blocking standard send when

the message buffer contents are ready to be transmitted

• The control returns immediately without waiting for the copy to the remote system

buffer to complete. MPI_Wait is called just before the sending task needs to overwrite

the message buffer

• Programmer is responsible for checking the status of the message to know whether

data to be sent has been copied out of the send buffer

• The receiving call (MPI_Irecv()) issues a non-blocking receive as soon as a message

buffer is ready to hold the message. The non-blocking receive returns without waiting

for the message to arrive. The receiving task calls MPI_Wait when it needs to use the

incoming message data

41





Non-blocking Calls

• When the system buffer is full, the blocking send would have to wait until the receiving task pulled some message data out of the buffer. Use of non-blocking call allows computation to be done during this interval, allowing for interleaving of computation and communication

• Non-blocking calls ensure that deadlock will not result

42



43

Topics

• Introduction

• MPI Standard





• Deadlock






Deadlock

• Something to avoid

• A situation where the dependencies between processors

are cyclic

– One processor is waiting for a message from another processor,

but that processor is waiting for a message from the first, so

nothing happens

• Until your time in the queue runs out and your job is killed

• MPI does not have timeouts

44



Deadlock Example

• If the message sizes are small enough, this should

work because of systems buffers

• If the messages are too large, or system buffering is

not used, this will hang

If (rank == 0) {

err = MPI_Send(sendbuf, count, datatype, 1, tag, comm);

err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm, &status);

}else {



}

45



Deadlock Example Solutions

If (rank == 0) {



}else {



}

or

If (rank == 0) {

err = MPI_Isend(sendbuf, count, datatype, 1, tag, comm, &req);

err = MPI_Irecv(recvbuf, count, datatype, 1, tag, comm);

err = MPI_Wait(req, &status);

}else {

err = MPI_Isend(sendbuf, count, datatype, 0, tag, comm, &req);

err = MPI_Irecv(recvbuf, count, datatype, 0, tag, comm);

err = MPI_Wait(req, &status);

}

46



47

Topics

• Introduction

• MPI Standard





• Deadlock






Numerical Integration Using Trapezoidal

Rule: A Case Study

• In review, the 6 main MPI calls:

– MPI_Init

– MPI_Finalize

– MPI_Comm_size

– MPI_Comm_rank

– MPI_Send

– MPI_Recv

• Using these 6 MPI function calls we can begin to

construct several kinds of parallel applications

• In the following section we discuss how to use these 6

calls to parallelize Trapezoidal Rule

48



Approximating Integrals: Definite Integral

• Problem : to find an approximate value to a definite

integral

• A definite integral from a to b of a non negative function

f(x) can be thought of as the area bound by the X-axis,

the vertical lines x=a and x=b, and the graph of f(x)

49



Approximating Integrals : Trapezoidal Rule

• Approximating area under the curve can be

done by dividing the region under the curve into

regular geometric shapes and then adding the

areas of the shapes.

• In Trapezoidal Rule, the region between a and b

can be divided into n trapezoids of base

h = (b-a)/n

• The area of a trapezoid can be calculated as

• In the case of our function the area for the first

block can be represented as

• The area under the curve bounded by a & b can

be approximated as :

50

h ( f ( a ) f (a h ))

2

h ( f (a ) f ( a h ))

2

h ( f ( a h ) f (a 2 h ))

2

h ( f (a 2 h ) f (a 3 h ))

2

h ( f ( a 3h ) f (b ))

2

h (b1

b2)

2

h (b1

b2)

2



Approximating Integrals: Trapezoid Rule

• We can further generalize this concept of approximation

of integrals as a summation of trapezoidal areas

51

1

2h [ f ( x

0) f ( x

1)]

1

2h [ f ( x

1) f ( x

2)] ...

1

2h [ f ( x

n 1) f ( x

n)]

h

2[ f ( x

0) f ( x

1) f ( x

1) f ( x

2) ... f ( x

n 1) f ( x

n)]

h

2[ f ( x

0) 2 f ( x

1) 2 f ( x

2) ... 2 f ( x

n 1) f ( x

n)]

hf ( x

0)

2f ( x

1) f ( x

2) ... f ( x

n 1)

f ( xn)

2



Trapezoidal Rule – Serial / Sequential

program in C

52

/* serial.c -- serial trapezoidal rule** Calculate definite integral using trapezoidal rule.* The function f(x) is hardwired.* Input: a, b, n.* Output: estimate of integral from a to b of f(x)* using n trapezoids.** See Chapter 4, pp. 53 & ff. in PPMPI.*/

#include <stdio.h>

main() {float integral; /* Store result in integral */float a, b; /* Left and right endpoints */int n; /* Number of trapezoids */float h; /* Trapezoid base width */float x;int i;float f(float x); /* Function we're integrating */

printf("Enter a, b, and n\n");scanf("%f %f %d", &a, &b, &n);

h = (b-a)/n;integral = (f(a) + f(b))/2.0;x = a;for (i = 1; i <= n-1; i++) {

x = x + h;integral = integral + f(x);

}integral = integral*h;

printf("With n = %d trapezoids, our estimate\n",n);

printf("of the integral from %f to %f = %f\n",a, b, integral);

} /* main */

float f(float x) {float return_val;/* Calculate f(x). Store calculation in return_val. */

return_val = x*x;return return_val;

} /* f */

Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4



Results for the Serial Trapezoidal Rule

a b n f(x) single precision f(x) double precision

2 25 1 7233.500000 7233.500000

2 25 2 5712.625000 5712.625000

2 25 10 5225.945312 5225.945000

2 25 30 5207.916992 5207.919815

2 25 40 5206.934082 5206.934062

2 25 50 5206.475098 5206.477800

2 25 1000 5205.664551 5205.668694

53


a b n f(x) single precision f(x) double precision

2 25 1 7233.500000 7233.500000

2 25 2 5712.625000 5712.625000

2 25 10 5225.945312 5225.945000

2 25 30 5207.916992 5207.919815

2 25 40 5206.934082 5206.934062

2 25 50 5206.475098 5206.477800

2 25 1000 5205.664551 5205.668694



Parallelizing Trapezoidal Rule

• One way of parallelizing Trapezoidal rule :– Distribute chunks of workload (each workload

characterized by its own subinterval of [a,b] to each process)

– Calculate f for each subinterval

– Finally add the f calculated for all the sub intervals to produce result for the complete problem [A,B]

• Issues to consider– Number of trapezoids (n) are equally divisible

across (p) processes (load balancing).

– First process calculates the area for the first

n/p trapezoids, second process calculates the area for the next n/p trapezoids and so on

• Key information related to the problem that each process needs is the – Rank of the process

– Ability to derive the workload per processor as a function of rank

Assumption : Process 0 does the summation

54

1 2



Parallelizing Trapezoidal Rule

• AlgorithmAssumption: Number of trapezoids n is evenly divisible across p processors

– Calculate:

– Each process calculates its own workload (interval to integrate)

• local number of trapezoids ( local_n) = n/p

• local starting point (local_a) = a+(process_rank *local_n* h)

• local ending point (local_b) = (local_a + local_n * h)

– Each process calculates its own integral for the local intervals

• For each of the local_n trapezoids calculate area

• Aggregate area for local_n trapezoids

– If PROCESS_RANK == 0

• Receive messages (containing sub-interval area aggregates) from all processors

• Aggregate (ADD) all sub-interval areas

– If PROCESS_RANK > 0

• Send sub-interval area to PROCESS_RANK(0)

Classic SPMD: all processes run the same program on different datasets.

55

hb a

n



Parallel Trapezoidal Rule

56

#include <stdio.h>#include "mpi.h”

main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float a = 0.0; /* Left endpoint */float b = 1.0; /* Right endpoint */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;





57

float Trap(float local_a, float local_b, int local_n,float h); /* Calculate local integral */

/* Let the system do what it needs to start up MPI */MPI_Init(&argc, &argv);

/* Get my process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Find out how many processes are being used */MPI_Comm_size(MPI_COMM_WORLD, &p);

h = (b-a)/n; /* h is the same for all processes */local_n = n/p; /* So is the number of trapezoids */

/* Length of each process' interval of* integration = local_n*h. So my interval* starts at: */

local_a = a + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h); Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4




58

/* Add up the integrals calculated by each process */if (my_rank == 0) {

total = integral;for (source = 1; source < p; source++) {

MPI_Recv(&integral, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);

total = total + integral;}

} else { MPI_Send(&integral, 1, MPI_FLOAT, dest,

tag, MPI_COMM_WORLD);}/* Print the result */if (my_rank == 0) {


printf("of the integral from %f to %f = %f\n",a, b, total);

}/* Shut down MPI */MPI_Finalize();

} /* main */ Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4




59

float Trap(float local_a /* in */,float local_b /* in */,int local_n /* in */,float h /* in */) {

float integral; /* Store result in integral */float x;int i;

float f(float x); /* function we're integrating */

integral = (f(local_a) + f(local_b))/2.0;x = local_a;for (i = 1; i <= local_n-1; i++) {


}integral = integral*h;return integral;

} /* Trap */float f(float x) {

float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;

} /* f */Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4




60

[cdekate@celeritas l7]$ mpiexec -n 8 … trapWith n = 1024 trapezoids, our estimateof the integral from 2.000000 to 25.000000 = 5205.667969Writing logfile....Finished writing logfile.

[cdekate@celeritas l7]$ ./serial Enter a, b, and n2 25 1024With n = 1024 trapezoids, our estimateof the integral from 2.000000 to 25.000000 = 5205.666016[cdekate@celeritas l7]$



61

Topics

• Introduction

• MPI Standard





• Deadlock






Profiling Applications

• To Profile your parallel applications:1. Compile the applications with

mpicc -profile=mpe_mpilog -o trap trap.c

2. Run your applications using the standard procedure using PBS/mpirun

3. After your run is complete you might see lines like these in your stdout (standardout / output file of your pbs-based runWriting logfile....Finished writing logfile.

4. You will also see a file with an extension “clog2”

5. I.e. if your executable was named “parallel_program” you would see a file named “parallel_program.clog2”

6. Convert the “clog2” to “slog2” format by issuing the command“clog2TOslog2 parallel_program.clog2”Maintain the capitalization in the clog2TOslog2 command

7. Step 6 will result in a parallel_program.slog2 file

8. Use Jumpshot to visualize this file

62



Using Jumpshot

Note : You need Java Runtime Environment on your

machine in order to be able to run Jumpshot

Download your parallel_program.slog2 file from Arete

• Download Jumpshot from :

– ftp://ftp.mcs.anl.gov/pub/mpi/slog2/slog2rte.tar.gz

– Uncompress the tar.gz file to get a folder : slog2rte-1.2.6/

– In the slog2rte-1.2.6/lib/

type java -jar jumpshot.jar parallel_program.slog2

• Or click on the jumpshot_launcher.jar

• Open the file using the jumpshot

file menu

63


ftp://ftp.mcs.anl.gov/pub/mpi/slog2/slog2rte.tar.gz

ftp://ftp.mcs.anl.gov/pub/mpi/slog2/slog2rte.tar.gz


64

Topics

• Introduction

• MPI Standard

• MPI-1.x Model and Basic Calls




• Deadlock






Summary : Material for the Test

• Basic MPI – 10, 11, 12

• Communicators – 18, 19, 20

• Point to Point Communication – 24, 25, 26, 27, 28

• In-depth Point to Point Communication – 33, 34, 35, 36,

37, 38, 39, 40, 41, 42

• Deadlock – 44, 45, 46

65



66


CSC 7600 Lecture 8 : MPI2

Spring 2011


MEANS

MESSAGE PASSING INTERFACE MPI

(PART B)

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityFebruary 10, 2011



Spring 20112

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics




Spring 2011

Review of Basic MPI Calls

• In review, the 6 main MPI calls:

– MPI_Init

– MPI_Finalize

– MPI_Comm_size

– MPI_Comm_rank

– MPI_Send

– MPI_Recv

• Include MPI Header file

– #include “mpi.h”

• Basic MPI Datatypes

– MPI_INT, MPI_FLOAT, ….

3



Spring 20114

Topics















Spring 2011

Collective Calls

• A communication pattern that encompasses all processes within a communicator is known as

collective communication

• MPI has several collective communication calls, the most frequently used are:– Synchronization

• Barrier

– Communication

• Broadcast

• Gather & Scatter

• All Gather

– Reduction

• Reduce

• AllReduce

5



Spring 20116

MPI Collective Calls : Barrier

Function: MPI_Barrier()

int MPI_Barrier (

MPI_Comm comm )

Description:Creates barrier synchronization in a

communicator group comm. Each process,

when reaching the MPI_Barrier call, blocks

until all the processes in the group reach the

same MPI_Barrier call.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Barrier.html

P0

P1

P2

P3

MP

I_B

arri

er()

P0

P1

P2

P3



Spring 2011

Example: MPI_Barrier()

7

#include <stdio.h>#include "mpi.h"

int main (int argc, char *argv[]){int rank, size, len;char name[MPI_MAX_PROCESSOR_NAME];MPI_Init(&argc, &argv);

MPI_Barrier(MPI_COMM_WORLD);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &len);

MPI_Barrier(MPI_COMM_WORLD);

printf ("Hello world! Process %d of %d on %s\n", rank, size, name);

MPI_Finalize();return 0;

}

[cdekate@celeritas collective]$ mpirun -np 8 barrierHello world! Process 0 of 8 on celeritas.cct.lsu.eduWriting logfile....Finished writing logfile.Hello world! Process 4 of 8 on compute-0-3.localHello world! Process 1 of 8 on compute-0-0.localHello world! Process 3 of 8 on compute-0-2.localHello world! Process 6 of 8 on compute-0-5.localHello world! Process 7 of 8 on compute-0-6.localHello world! Process 5 of 8 on compute-0-4.localHello world! Process 2 of 8 on compute-0-1.local[cdekate@celeritas collective]$



Spring 20118

Topics















Spring 20119

MPI Collective Calls : Broadcast

Function: MPI_Bcast()

int MPI_Bcast (

void *message,

int count,


int root,

MPI_Comm comm )

Description:A collective communication call where a single process sends the same data contained in the

message to every process in the communicator. By default a tree like algorithm is used to broadcast

the message to a block of processors, a linear algorithm is then used to broadcast the message from

the first process in a block to all other processes. All the processes invoke the MPI_Bcast call with the

same arguments for root and comm,

float endpoint[2]; ...

MPI_Bcast(endpoint, 2, MPI_FLOAT, 0, MPI_COMM_WORLD);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Bcast.html

AP0

P1

P2

P3

A

A

A

AP0

P1

P2

P3

Broadcast



Spring 201110

MPI Collective Calls : Scatter

Function: MPI_Scatter()

int MPI_Scatter (

void *sendbuf,

int send_count,

MPI_Datatype send_type,

void *recvbuf,

int recv_count,

MPI_Datatype recv_type,

int root,

MPI_Comm comm)

Description:MPI_Scatter splits the data referenced by the sendbuf on the process with rank root into p segments each

of which consists of send_count elements of type send_type. The first segment is sent to process0 and the

second segment to process1.The send arguments are significant on the process with rank root.

...

MPI_Scatter(&(local_A[0][0]), n/p, MPI_FLOAT, row_segment, n/p, MPI_FLOAT, 0,MPI_COMM_WORLD);

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Scatter.html

A, B, C, DP0

P1

P2

P3

B

C

D

AP0

P1

P2

P3

Scatter

local_A[][] row_segment



Spring 201111

MPI Collective Calls : Gather

Function: MPI_Gather()

int MPI_Gather (

void *sendbuf,

int send_count,

MPI_Datatype sendtype,

void *recvbuf,

int recvcount,

MPI_Datatype recvtype,

int root,

MPI_Comm comm )

Description:MPI_Gather collects the data referenced by sendbuf from each process in the communicator comm, and

stores the data in process rank order on the process with rank root in the location referenced by

recvbuf.The recv parameters are only significant.

...

MPI_Gather(local_x, n/p, MPI_FLOAT, global_x, n/p, MPI_FLOAT, 0, MPI_COMM_WORLD);

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Gather.html

A, B, C, DP0

P1

P2

P3

B

C

D

AP0

P1

P2

P3

Gather

global_x local_x



Spring 201112

MPI Collective Calls : All Gather

Function: MPI_Allgather()

int MPI_Allgather (

void *sendbuf,

int send_count,

MPI_Datatype sendtype,

void *recvbuf,

int recvcount,

MPI_Datatype recvtype,

MPI_Comm comm )

Description:MPI_Allgather gathers the content from the send buffer (sendbuf) on each process. The effect of this call

is similar to executing MPI_Gather() p times with a different process acting as the root.for (root=0; root<p; root++)

MPI_Gather(local_x, n/p, MPI_FLOAT, global_x, n/p, MPI_FLOAT, root, MPI_COMM_WORLD);

...

CAN BE REPLACED WITH :

MPI_Allgather(local_x, local_n, MPI_FLOAT, global_x, local_n, MPI_FLOAT, MPI_COMM_WORLD);

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Allgather.html

B

C

D

AP0

P1

P2

P3

A, B, C, D

A, B, C, D

A, B, C, D

A, B, C, DP0

P1

P2

P3

All Gather



Spring 201113

Topics















Spring 201114

MPI Collective Calls : ReduceFunction: MPI_Reduce()

int MPI_Reduce (

void *operand,

void *result,

int count,


MPI_Op operator,

int root,

MPI_Comm comm )

Description:A collective communication call where all the processes in a communicator contribute data that is

combined using binary operations (MPI_Op) such as addition, max, min, logical, and, etc. MPI_Reduce

combines the operands stored in the memory referenced by operand using the operation operator and

stores the result in *result. MPI_Reduce is called by all the processes in the communicator comm and for

each of the processes count, datatype operator and root remain the same....

MPI_Reduce(&local_integral, &integral, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Reduce.html

B

C

D

AP0

P1

P2

P3

A+B+C+DP0

P1

P2

P3

Reduce

Binary Op= MPI_SUM



Spring 2011

MPI Binary Operations

15

• MPI binary operators are used in the MPI_Reduce function call as one of the

parameters. MPI_Reduce performs a global reduction operation (dictated by the MPI

binary operator parameter) on the supplied operands.

• Some of the common MPI Binary Operators used are :

Operation Name MeaningMPI_MAX MaximumMPI_MIN MinimumMPI_SUM SumMPI_PROD ProductMPI_LAND Logical AndMPI_BAND Bitwise AndMPI_LOR Logical OrMPI_BOR Bitwise OrMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Maximum and location of max.MPI_MINLOC Maximum and location of min.

MPI_Reduce(&local_integral,

&integral, 1, MPI_FLOAT,

MPI_SUM, 0, MPI_COMM_WORLD);



Spring 201116

MPI Collective Calls : All Reduce

Functi

on:

MPI_Allreduce()

int MPI_Allreduce (

void *sendbuf,

void *recvbuf,

int count,


MPI_Op op,

MPI_Comm comm )

Description:MPI_Allreduce is used exactly like MPI_Reduce,

except that the result of the reduction is returned

on all processes, as a result there is no root

parameter....

MPI_Allreduce(&integral, &integral, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Allreduce.html

B

C

D

AP0

P1

P2

P3

A+B+C+D

A+B+C+D

A+B+C+D

A+B+C+DP0

P1

P2

P3

All Reduce

Binary Op= MPI_SUM



Spring 2011


Send, Recv

17

#include <stdio.h>#include "mpi.h”

main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float a = 0.0; /* Left endpoint */float b = 1.0; /* Right endpoint */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;



/* Let the system do what it needs to start up MPI */MPI_Init(&argc, &argv);

/* Get my process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Find out how many processes are being used */MPI_Comm_size(MPI_COMM_WORLD, &p);

h = (b-a)/n; /* h is the same for all processes */local_n = n/p; /* So is the number of trapezoids */

/* Length of each process' interval of* integration = local_n*h. So my interval* starts at: */local_a = a + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h);



Spring 2011


Send, Recv

18

if (my_rank == 0) {total = integral;for (source = 1; source < p; source++) {

MPI_Recv(&integral, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);

total = total + integral;}

} else {

MPI_Send(&integral, 1, MPI_FLOAT, dest,tag, MPI_COMM_WORLD);

}if (my_rank == 0) {


printf("of the integral from %f to %f = %f\n",a, b, total);

}MPI_Finalize();} /* main */




Spring 2011


Send, Recv

19







} /* Trap */float f(float x) {

float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;

} /* f */Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4



Spring 201120

Flowchart for Parallel Trapezoidal RuleMASTER WORKERS

Initialize MPI Environment



… Initialize MPI Environment

Create Local Workload buffer (Variables etc)

…

Create Local Workload buffer



Isolate work regions

Isolate work regions Isolate work

regionsIsolate work

regionsCalculate

Sequential Trapezoid rule

for Local region

…

… Calculate Sequential

Trapezoid rule

for Local region

Calculate Sequential

Trapezoid rule

for Local region

Calculate Sequential

Trapezoid rule

for Local region

Integrate results for local workload

Recv. results from “workers”

Send result to “master”



…

Concatenate results to file

End

Calculate integral

Calculate integral

Calculate integral



Spring 2011

Trapezoidal Rule :

with MPI_Bcast, MPI_Reduce

21

#include <stdio.h>#include <stdlib.h>

/* We'll be using MPI routines, definitions, etc. */#include "mpi.h"

main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float endpoint[2]; /* Left and right */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for */

/* my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;


MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Comm_size(MPI_COMM_WORLD, &p);

if (argc != 3) {if (my_rank==0)

printf("Usage: mpirun -np <numprocs> trapezoid <left> <right>\n");

MPI_Finalize();exit(0);

}

if (my_rank==0) {endpoint[0] = atof(argv[1]); /* left endpoint */endpoint[1] = atof(argv[2]); /* right endpoint */

}

MPI_Bcast(endpoint, 2, MPI_FLOAT, 0, MPI_COMM_WORLD);



Spring 2011

Trapezoidal Rule :


22

h = (endpoint[1]-endpoint[0])/n; /* h is the same for all processes */

local_n = n/p; /* so is the number of trapezoids */if (my_rank == 0) printf("a=%f, b=%f, Local number of

trapezoids=%d\n", endpoint[0], endpoint[1], local_n );

local_a = endpoint[0] + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h);

MPI_Reduce(&integral, &total, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);

if (my_rank == 0) {printf("With n = %d trapezoids, our estimate\n",

n);printf("of the integral from %f to %f = %f\n",

endpoint[0], endpoint[1], total);}

MPI_Finalize();} /* main */







} /* Trap */

float f(float x) {float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;

} /* f */



Spring 2011

Trapezoidal Rule :


23

#!/bin/bash

#PBS -N name

#PBS -l walltime=120:00:00,nodes=2:ppn=4

cd /home/lsu00/Demos/l9/trapBcast

pwd

date

PROCS=`wc -l < $PBS_NODEFILE`

mpdboot --file=$PBS_NODEFILE

/usr/lib64/mpich2/bin/mpiexec -n $PROCS ./trapBcast 2 25 >>out.txt

mpdallexit

date



Spring 201124

Topics















Spring 2011

Constructing Datatypes

• Creating data structures in C:typedef struct {

. . .} STRUCT_NAME

• For example : In the numerical integration by trapezoidal rule we could create a data structure for storing the attributes of the problem as follows: typedef struct {

float a,float b,int n;

} DATA_INTEGRAL; . . .. . .DATA_INTEGRAL intg _data;

• What would happen when you use:

MPI_Bcast( &intg_data, 1, DATA_INTEGRAL, 0, MPI_COMM_WORLD);

25

ERROR!!! Intg_data is of the type DATA_INTEGRAL NOT an MPI_Datatype



Spring 2011

Constructing MPI Datatypes

• MPI allows users to define derived MPI datatypes, using basic datatypes that build during execution time

• These derived data types can be used in the MPI communication calls, instead of the basic predefined datatypes.

• A sending process can pack noncontiguous data into contiguous buffer and send the buffered data to a receiving process that can unpack the contiguous buffer and store the data to noncontiguous location.

• A derived datatype is an opaque object that specifies :– A sequence of primitive datatypes

– A sequence of integer (byte) displacements

• MPI has several functions for constructing derived datatypes :– Contiguous

– Vector

– Indexed

– Struct

26



Spring 2011

MPI : Basic Data Types

(Review)

MPI datatype C datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

27

You can also define your own (derived datatypes), such as an array of ints of size 100, or more complex examples, such as a struct or an array of structs



Spring 201128

Topics















Spring 201129

Derived Datatypes : Contiguous

Function: MPI_Type_contiguous()

int MPI_Type_contiguous(

int count,

MPI_Datatype old_type,

MPI_Datatype *new_type)

Description:This is the simplest constructor in the MPI derived datatypes. Contiguous datatype constructors create a new datatype by making count copies of existing data type (old_type)

MPI_Datatype rowtype;...

MPI_Type_contiguous(SIZE, MPI_FLOAT, &rowtype);

MPI_Type_commit(&rowtype);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Type_contiguous.html



Spring 2011

Example : Derived Datatypes - Contiguous

30


#define SIZE 4

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;

float a[SIZE][SIZE] ={1.0, 2.0, 3.0, 4.0,5.0, 6.0, 7.0, 8.0,9.0, 10.0, 11.0, 12.0,13.0, 14.0, 15.0, 16.0};

float b[SIZE];

MPI_Status stat;

MPI_Datatype rowtype;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Type_contiguous(SIZE, MPI_FLOAT, &rowtype);MPI_Type_commit(&rowtype);if (numtasks == SIZE) {if (rank == 0) {

for (i=0; i<numtasks; i++){dest = i;

MPI_Isend(&a[i][0], 1, rowtype, dest, tag, MPI_COMM_WORLD, &req);

}}

MPI_Recv(b, SIZE, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf(“rank= %d b= %3.1f %3.1f %3.1f %3.1f\n”,

rank,b[0],b[1],b[2],b[3]);}

elseprintf(“Must specify %d processors. Terminating.\n”,SIZE);

MPI_Type_free(&rowtype);MPI_Finalize();}

Declares a 4x4 array of datatype float

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0

https://computing.llnl.gov/tutorials/mpi/

Homogenous datastructureof size 4 (Type : rowtype)

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0



Spring 2011

Example : Derived Datatypes - Contiguous

31




Spring 201132

Topics















Spring 201133

Derived Datatypes : Vector

Function: MPI_Type_vector()

int MPI_Type_vector(

int count,

int blocklen,

int stride,

MPI_Datatype old_type,

MPI_Datatype *newtype )

Description:

Returns a new datatype that represents equally spaced blocks. The spacing between the start of each block is given in units of extent (oldtype). The count represents the number of blocks, blocklen details the number of elements in each block, stride represents the number of elements between start of each block of the old_type. The new datatype is stored in new_type

...

MPI_Type_vector(SIZE, 1, SIZE, MPI_FLOAT, &columntype);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Type_vector.html



Spring 2011

Example : Derived Datatypes - Vector

34

#include "mpi.h"#include <stdio.h>#define SIZE 4

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;float a[SIZE][SIZE] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,

13.0, 14.0, 15.0, 16.0};float b[SIZE];

MPI_Status stat;MPI_Datatype columntype;


MPI_Type_vector(SIZE, 1, SIZE, MPI_FLOAT, &columntype);MPI_Type_commit(&columntype);

if (numtasks == SIZE) {if (rank == 0) {

for (i=0; i<numtasks; i++)

MPI_Isend(&a[0][i], 1, columntype, i, tag, MPI_COMM_WORLD, &req);

}

MPI_Recv(b, SIZE, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d b= %3.1f %3.1f %3.1f %3.1f\n",

rank,b[0],b[1],b[2],b[3]);}

elseprintf("Must specify %d processors. Terminating.\n",SIZE);

MPI_Type_free(&columntype);MPI_Finalize();}


Declares a 4x4 array of datatype float

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0

Homogenous datastructure of size 4

(Type : columntype)

1.0

5.0

9.0

13.0

2.0

6.0

10.0

14.0

3.0

7.0

11.0

15.0

4.0

8.0

12.0

16.0



Spring 2011

Example : Derived Datatypes - Vector

35




Spring 201136

Topics















Spring 201137

Derived Datatypes : Indexed

Function: MPI_Type_indexed()

int MPI_Type_indexed(

int count,

int *array_of_blocklengths,

int *array_of_displacements,

MPI_Datatype oldtype,

MPI_datatype *newtype);

Description:

Returns a new datatype that represents count blocks. Each block is defined by an entry in array_of_blocklengths and

array_of_displacements. Displacements are expressed in units of extent(oldtype). The count is the number of blocks

and the number of entries in array_of_displacements (displacement of each block in units of the oldtype) and

array_of_blocklengths (number of instances of oldtype in each block).

...

MPI_Type_indexed(2, blocklengths, displacements, MPI_FLOAT,

&indextype);...

https://computing.llnl.gov/tutorials/mpi/man/MPI_Type_indexed.txt



Spring 2011

Example : Derived Datatypes - Indexed

38

#include "mpi.h"#include <stdio.h>#define NELEMENTS 6

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;int blocklengths[2], displacements[2];

float a[16] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0};

float b[NELEMENTS];

MPI_Status stat;MPI_Datatype indextype;


blocklengths[0] = 4;blocklengths[1] = 2;displacements[0] = 5;displacements[1] = 12;

MPI_Type_indexed(2, blocklengths, displacements, MPI_FLOAT, &indextype);MPI_Type_commit(&indextype);

if (rank == 0) {for (i=0; i<numtasks; i++)

MPI_Isend(a, 1, indextype, i, tag, MPI_COMM_WORLD, &req);}

MPI_Recv(b, NELEMENTS, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d b= %3.1f %3.1f %3.1f %3.1f %3.1f %3.1f\n",

rank,b[0],b[1],b[2],b[3],b[4],b[5]);

MPI_Type_free(&indextype);MPI_Finalize();}


Declares a [16][1] array of type float

1.0 2.0 3.0 4.0 5.0 6.0 7.08.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0

Creates a new datatype indextype

1.0 2.0 3.0 4.0 5.0 6.07.08.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0



Spring 2011

Example : Derived Datatypes - Indexed

39




Spring 201140

Topics















Spring 201141

Derived Datatypes : struct

Function: MPI_Type_struct()

int MPI_Type_struct(

int count,

int *array_of_blocklengths,

MPI_Aint *array_of_displacements,

MPI_Datatype *array_of_types,

MPI_datatype *newtype);

Description:

Returns a new datatype that represents count blocks. Each is defined by an entry in array_of_blocklengths,

array_of_displacements and array_of_types. Displacements are expressed in bytes. count is an integer that specifies

the number of blocks (number of entries in arrays. The array_of_blocklengths is the number of elements in each

blocks & array_of_displacements specifies the byte displacement of each block. The array_of_types parameter

comprising each block is made of concatenation of type array_of_types.

...

MPI_Type_struct(2, blockcounts, offsets, oldtypes, &particletype);...

https://computing.llnl.gov/tutorials/mpi/man/MPI_Type_struct.txt



Spring 2011

Example : Derived Datatype - struct

42

#include "mpi.h"#include <stdio.h>#define NELEM 25int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, I;typedef struct {float x, y, z;float velocity;int n, type;} Particle;

Particle p[NELEM], particles[NELEM];MPI_Datatype particletype, oldtypes[2]; int blockcounts[2];

/* MPI_Aint type used to be consistent with syntax of *//* MPI_Type_extent routine */MPI_Aint offsets[2], extent;

MPI_Status stat;


/* Setup description of the 4 MPI_FLOAT fields x, y, z, velocity */offsets[0] = 0;oldtypes[0] = MPI_FLOAT;blockcounts[0] = 4;

MPI_Type_extent(MPI_FLOAT, &extent);offsets[1] = 4 * extent;oldtypes[1] = MPI_INT;blockcounts[1] = 2;MPI_Type_struct(2, blockcounts, offsets, oldtypes, &particletype);MPI_Type_commit(&particletype);

if (rank == 0) {for (i=0; i<NELEM; i++) {

particles[i].x = i * 1.0;particles[i].y = i * -1.0;particles[i].z = i * 1.0; particles[i].velocity = 0.25;particles[i].n = i;particles[i].type = i % 2;

}for (i=0; i<numtasks; i++)

MPI_Send(particles, NELEM, particletype, i, tag, MPI_COMM_WORLD);}

MPI_Recv(p, NELEM, particletype, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d %3.2f %3.2f %3.2f %3.2f %d %d\n", rank,p[3].x,

p[3].y,p[3].z,p[3].velocity,p[3].n,p[3].type);

MPI_Type_free(&particletype);MPI_Finalize();}


Declaring the structure of the heterogeneous datatype Float, Float, Float, Float, Int, Int

Construct the heterogeneous datatype as an MPI datatype using Struct

Populate the heterogenous MPI datatype with heterogeneous data



Spring 201143


Example : Derived Datatype - struct



Spring 201144

Topics















Spring 201145

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Vector Multiplication



Spring 201146


where A is an n x m matrix and B is a vector of size m and C is a vector of size n.


Multiplication of a matrix, A and a vector B, produces the vector Cwhose elements, ci (0 <= i < n), are computed as follows:

Ci Aik *Bkk0

km



Spring 201147

Matrix-Vector Multiplicationc = A x b



Spring 2011

Example: Matrix-Vector Multiplication

(DEMO)

48

#include "mpi.h"#include <stdio.h>#include <stdlib.h>

#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 1 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */

int main (int argc, char *argv[]){int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

Define the dimensions of the Matrix a([4][4]) and Vector b([4][1])



Spring 201149

double a[NRA][NCA], /* Matrix A to be multiplied */b[NCA][NCB], /* Vector B to be multiplied */c[NRA][NCB]; /* result Vector C */

MPI_Status status;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2 ) {printf("Need at least two MPI tasks. Quitting...\n");

MPI_Abort(MPI_COMM_WORLD, rc);exit(1);}

numworkers = numtasks-1;/**************************** master task ************************************/

if (taskid == MASTER){

printf("mpi_mm has started with %d tasks.\n",numtasks);printf("Initializing arrays...\n");

for (i=0; i<NRA; i++)for (j=0; j<NCA; j++)

a[i][j]= i+j;for (i=0; i<NCA; i++)

for (j=0; j<NCB; j++)b[i][j]= (i+1)*(j+1);


Declare the matrix , vector to be multiplied and the resultant vector

MASTER Initializes the Matrix A :0.00 1.00 2.00 3.00 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00

MASTER Initializes B :1.00 2.00 3.00 4.00



Spring 201150

for (i=0; i<NRA; i++){

printf("\n"); for (j=0; j<NCA; j++)

printf("%6.2f ", a[i][j]);}

for (i=0; i<NRA; i++){

printf("\n"); for (j=0; j<NCB; j++)

printf("%6.2f ", b[i][j]);}

/* Send matrix data to the worker tasks */

averow = NRA/numworkers;extra = NRA%numworkers;offset = 0;mtype = FROM_MASTER;for (dest=1; dest<=numworkers; dest++){

rows = (dest <= extra) ? averow+1 : averow;

printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,

MPI_COMM_WORLD);MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);

offset = offset + rows;}


Load Balancing : Dividing the Matrix A based on the number of processors

MASTER sends Matrix A to workers :PROC[0] :: 0.00 1.00 2.00 3.00 PROC[1] :: 1.00 2.00 3.00 4.00 PROC[2] :: 2.00 3.00 4.00 5.00 PROC[3] :: 3.00 4.00 5.00 6.00

MASTER Sends Vector B to Workers:PROC[0] :: 1.00 2.00 3.00 4.00 PROC[1] :: 1.00 2.00 3.00 4.00 PROC[2] :: 1.00 2.00 3.00 4.00 PROC[3] :: 1.00 2.00 3.00 4.00



Spring 2011


51

/* Receive results from worker tasks */mtype = FROM_WORKER;for (i=1; i<=numworkers; i++){

source = i;

MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,

MPI_COMM_WORLD, &status);printf("Received results from task %d\n",source);

}

/* Print results */printf("******************************************************\n");printf("Result Matrix:\n");for (i=0; i<NRA; i++){


printf("%6.2f ", c[i][j]);}printf("\n******************************************************\n");printf ("Done.\n");

}

The Master process gathers the results and populates the result matrix in the correct order (easily done in this case because matrix index I is used to indicate position in result array)



Spring 2011


52

/**************************** worker task ************************************/if (taskid > MASTER){

mtype = FROM_MASTER;

MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++)for (i=0; i<rows; i++){

c[i][k] = 0.0;for (j=0; j<NCA; j++)

c[i][k] = c[i][k] + a[i][j] * b[j][k];}

mtype = FROM_WORKER;MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);

}MPI_Finalize();

}

Worker Processes receive workloadProc[1] A : 1.00 2.00 3.00 4.00Proc[1] B : 1.00 2.00 3.00 4.00

Calculate ResultProc[1] C : 1.00 + 4.00 + 9.00 + 16.00

Calculate ResultProc[1] C : 1.00 + 4.00 + 9.00 + 16.00



Spring 2011


(Results)

53



Spring 201154

Topics















Spring 201155

MPI Profiling : MPI_Wtime

Function: MPI_Wtime()

double MPI_Wtime()

Description:

Returns time in seconds elapsed on the calling processor. Resolution of time scale

is determined by the MPI environment variable MPI_WTICK. When the MPI

environment variable MPI_WTIME_IS_GLOBAL is defined and set to true, the the

value of MPI_Wtime is synchronized across all processes in MPI_COMM_WORLD

double time0;...

time0 = MPI_Wtime();...

printf("Hello From Worker #%d %lf \n", rank, (MPI_Wtime() – time0));

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Wtime.html



Spring 2011

Timing Example: MPI_Wtime

56

#include <stdio.h>

#include ”mpi.h”

main(int argc, char **argv)

{

int size, rank;

double time0, time1;MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

time0 = MPI_Wtime();

if(rank==0)

{

printf(" Hello From Proc0 Time = %lf \n", (MPI_Wtime() – time0));}

else

{

printf("Hello From Worker #%d %lf \n", rank, (MPI_Wtime() – time0));}

MPI_Finalize();

}



Spring 201157

Topics















Spring 2011

Additional Topics

• Additional topics not yet covered :

– Communication Topologies

– Profiling using Tau (to be covered with PAPI & Parallel

Algorithms)

– Profiling using PMPI (to be covered with PAPI & Parallel

Algorithms)

– Debugging MPI programs

58



Spring 201159

Topics















Spring 2011


• Collective calls

– Barrier (6) , Broadcast (9), Scatter(10), Gather(11), Allgather(12)

– Reduce(14), Binary operations (15), All Reduce (16)

• Derived Datatypes (25,26,27)

– Contiguous (29,30,31)

– Vector (33,34,35)

60



Spring 201161


CSC 7600 Lecture 9 : SMP Nodes Spring 2011


SMP NODES



2

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test


3

Topics



4

Opening Remarks

•  This week is about supercomputer architecture –  Last time: end of cooperative computing –  Today: capability computing with modern microprocessor and

multicore SMP node

•  As we’ve seen, there is a diversity of HPC system types •  Most common systems are either SMPs or are

ensembles of SMP nodes •  “SMP” stands for: “Symmetric Multi-Processor” •  System performance is strongly influenced by SMP node

performance •  Understanding structure, functionality, and operation of

SMP nodes will allow effective programming


5

The take-away message

•  Primary structure and elements that make up an SMP node

•  Primary structure and elements that make up the modern multicore microprocessor component

•  The factors that determine microprocessor delivered performance

•  The factors that determine overall SMP sustained performance

•  Amdahl’s law and how to use it •  Calculating cpi •  Reference: J. Hennessy & D. Patterson, “Computer Architecture

A Quantitative Approach” 3rd Edition, Morgan Kaufmann, 2003


6

Topics



7

SMP Context

•  A standalone system –  Incorporates everything needed for

•  Processors •  Memory •  External I/O channels •  Local disk storage •  User interface

–  Enterprise server and institutional computing market •  Exploits economy of scale to enhance performance to cost •  Substantial performance

–  Target for ISVs (Independent Software Vendors) •  Shared memory multiple thread programming platform

–  Easier to program than distributed memory machines –  Enough parallelism to fully employ system threads (processor cores)

•  Building block for ensemble supercomputers –  Commodity clusters –  MPPs


8

Topics



9

Performance: Amdahl’s Law

Baton Rouge to Houston •  from my house on East Lakeshore Dr. •  to downtown HyaT Regency •  distance of 271 •  in air flight Vme: 1 hour •  door to door Vme to drive: 4.5 hours •  cruise speed of Boeing 737: 600 mph •  cruise speed of BMW 528: 60 mph


10

Amdahl’s Law: drive or fly? •  Peak performance gain: 10X

–  BMW cruise approx. 60 MPH –  Boeing 737 cruise approx. 600 MPH

•  Time door to door –  BMW

•  Google estimates 4 hours 30 minutes –  Boeing 737

•  Time to drive to BTR from my house = 15 minutes •  Wait time at BTR = 1 hour •  Taxi time at BTR = 5 minutes •  Continental estimates BTR to IAH 1 hour •  Taxi time at IAH = 15 minutes (assuming gate available) •  Time to get bags at IAH = 25 minutes •  Time to get rental car = 15 minutes •  Time to drive to Hyatt Regency from IAH = 45 minutes •  Total time = 4.0 hours

•  Sustained performance gain: 1.125X


11

Amdahl’s Law

( )

( )

⎟⎟⎠

⎞⎜⎜⎝

⎛+−

=

×⎟⎟⎠

⎞⎜⎜⎝

⎛+×−

=

×⎟⎟⎠

⎞⎜⎜⎝

⎛+×−=

=

=

≡

≡

≡

≡

≡

≡

gff

S

TgfTf

TS

TgfTfT

TTfTTS

SfgTTT

OO

O

OOA

OF

AO

F

A

O

1

1

1

1

appliedon acceleratin with computatio of up speed daccelerate be n tocomputatio daccelerate-non offraction

ncomputatio ofportion dacceleratefor gain eperformancpeak daccelerate becan n that computatio ofportion of time

ncomputatio dacceleratefor timencomputatio daccelerate-nonfor time

start end

TO

TF

start end

TA

TF/g


12

Amdahl’s Law and Parallel Computers

•  Amdahl’s Law (FracX: original % to be speed up) Speedup = 1 / [(FracX/SpeedupX) + (1-FracX)]

•  A portion is sequential => limits parallel speedup –  Speedup <= 1/ (1-FracX)

•  Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used

80 = 1 / [(FracX/100) + (1-FracX)] 0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1 FracX = (80-1)/79.2 = 0.9975 •  Only 0.25% sequential!


13

Amdahl’s Law with Overhead

( )

( )

( )O

OO

O

A

O

OOA

n

ii

n

iFiF

Tvn

gff

S

vnTgfTf

TTTS

vnTgfTfT

vV

v

tT

×++−

=

×+×+×−==

×+×+×−=

=≡

≡

=

∑

∑

1

1

1

1

workdacceleratefor overhead total

segment work daccelerate of overhead

start end

TO

tF

start end

TA

v + tF/g

tF tF tF


14

Topics



15

SMP Node Diagram

MP L1 L2

MP L1 L2

L3

MP L1 L2

MP L1 L2

L3

M1 M2 Mn

Controller

S

S

NIC NIC USB Peripherals

JTAG

Legend : MP : MicroProcessor L1,L2,L3 : Caches M1, M2, … : Memory Banks S : Storage NIC : Network Interface Card

Ethernet

PCI-‐e


16

SMP System Examples

Vendor & name Processor Number of cores

Cores per proc.

Memory Chipset PCI slots

IBM eServer p5 595

IBM Power5 1.9 GHz

64 2 2 TB Proprietary GX+, RIO-‐2

≤240 PCI-‐X (20 standard)

Microway QuadPuter-‐8

AMD Opteron 2.6 Ghz

16 2 128 GB Nvidia nForce Pro 2200+2050

6 PCIe

Ion M40 Intel Itanium 2 1.6 GHz

8 2 128 GB Hitachi CF-‐3e 4 PCIe 2 PCI-‐X

Intel Server System SR870BN4

Intel Itanium 2 1.6 GHz

8 2 64 GB Intel E8870 8 PCI-‐X

HP Proliant ML570 G3

Intel Xeon 7040 3 GHz

8 2 64 GB Intel 8500 4 PCIe 6 PCI-‐X

Dell PowerEdge 2950

Intel Xeon 5300 2.66 GHz

8 4 32 GB Intel 5000X 3 PCIe


17

Sample SMP Systems

DELL PowerEdge

HP Proliant

Intel Server System

IBM p5 595

Microway Quadputer


18

HyperTransport-based SMP System

Source: hTp://www.devx.com/amd/ArVcle/17437


19

Comparison of Opteron and Xeon SMP Systems

Source: hTp://www.devx.com/amd/ArVcle/17437


20

Multi-Chip Module (MCM) Component of IBM Power5 Node

20


21

Major Elements of an SMP Node •  Processor chip •  DRAM main memory cards •  Motherboard chip set •  On-board memory network

–  North bridge •  On-board I/O network

–  South bridge •  PCI industry standard interfaces

–  PCI, PCI-X, PCI-express •  System Area Network controllers

–  e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch •  System Management network

–  Usually Ethernet –  JTAG for low level maintenance

•  Internal disk and disk controller •  Peripheral interfaces


22

Topics



23

FPU IA-‐32 Control

Instr. Fetch & Decode Cache

Cache

TLB

Integer Units

IA-‐64 Control

Bus

Core Processor Die 4 x 1MB L3 cache

Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00)


24

Multicore Microprocessor Component Elements

•  Multiple processor cores –  One or more processors

•  L1 caches –  Instruction cache –  Data cache

•  L2 cache –  Joint instruction/data cache –  Dedicated to individual core processor

•  L3 cache –  Not all systems –  Shared among multiple cores –  Often off die but in same package

•  Memory interface –  Address translation and management (sometimes) –  North bridge

•  I/O interface –  South bridge


25

Comparison of Current Microprocessors

Processor Clock rate Caches (per core)

ILP (each core)

Cores per chip

Process & die size

Power Linpack TPP (one core)

AMD Opteron 2.6 GHz L1I: 64KB L1D: 64KB L2: 1MB

2 FPops/cycle 3 Iops/cycle 2* LS/cycle

2 90nm, 220mm2

95W 3.89 Gflops

IBM Power5+ 2.2 GHz L1I: 64KB L1D: 32KB L2: 1.875MB L3: 18MB

4 FPops/cycle 2 Iops/cycle 2 LS/cycle

2 90nm, 243mm2

180W (est.) 8.33 Gflops

Intel Itanium 2 (9000 series)

1.6 GHz L1I: 16KB L1D: 16KB L2I: 1MB L2D: 256KB L3: 3MB or more

4 FPops/cycle 4 Iops/cycle 2 LS/cycle

2 90nm, 596mm2

104W 5.95 Gflops

Intel Xeon Woodcrest

3 GHz L1I: 32KB L1D: 32KB L2: 2MB

4 Fpops/cycle 3 Iops/cycle 1L+1S/cycle

2 65nm, 144mm2

80W 6.54 Gflops


26

Processor Core Micro Architecture

•  Execution Pipeline –  Stages of functionality to process issued instructions –  Hazards are conflicts with continued execution –  Forwarding supports closely associated operations exhibiting

precedence constraints •  Out of Order Execution

–  Uses reservation stations –  hides some core latencies and provide fine grain asynchronous

operation supporting concurrency •  Branch Prediction

–  Permits computation to proceed at a conditional branch point prior to resolving predicate value

–  Overlaps follow-on computation with predicate resolution –  Requires roll-back or equivalent to correct false guesses –  Sometimes follows both paths, and several deep


27

Topics



28

Recap: Who Cares About the Memory Hierarchy?

µProc 60%/yr. (2X/1.5yr)

DRAM 9%/yr. (2X/10 yrs)

1!

10!

100!

1000!

1980

!19

81!

1983

!19

84!

1985

!19

86!

1987

!19

88!

1989

!19

90!

1991

!19

92!

1993

!19

94!

1995

!19

96!

1997

!19

98!

1999

!20

00!

DRAM

CPU!

1982

!

Processor-‐Memory Performance Gap: (grows 50% / year)

Performan

ce

Time

“Moore’s Law”

Processor-‐DRAM Memory Gap (latency)

Copyright 2001, UCB, David PaTerson

CSC 7600 Lecture 9 : SMP Nodes Spring 2011 29

What is a cache? •  Small, fast storage used to improve average access time to slow

memory. •  Exploits spatial and temporal locality •  In computer architecture, almost everything is a cache!

–  Registers: a cache on variables –  First-level cache: a cache on second-level cache –  Second-level cache: a cache on memory –  Memory: a cache on disk (virtual memory) –  TLB :a cache on page table –  Branch-prediction: a cache on prediction information

Proc/Regs

L1-‐Cache

L2-‐Cache

Memory

Disk, Tape, etc.

Bigger Faster



30

Levels of the Memory Hierarchy

CPU Registers 100s Bytes < 0.5 ns (typically 1 CPU cycle)

Cache L1 cache: 10s-‐100s K Bytes 1-‐5 ns $10/ Mbyte

Main Memory Few G Bytes 50ns-‐ 150ns $0.02/ MByte

Disk 100s-‐1000s G Bytes 500000ns-‐ 1500000ns $ 0.25/ GByte

Capacity Access Time Cost

Tape infinite sec-‐min $0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

Staging Xfer Unit

prog./compiler 1-‐8 bytes

cache cntl 8-‐128 bytes

OS 512-‐4K bytes

user/operator Mbytes

Upper Level

Lower Level

faster

Larger



31

Cache Measures

•  Hit rate: fraction found in that level –  So high that usually talk about Miss rate

•  Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)

•  Miss penalty: time to replace a block from lower level, including time to replace in CPU

–  access time: time to lower level = f(latency to lower level)

–  transfer time: time to transfer block =f(BW between upper & lower levels)



32

Memory Hierarchy: Terminology •  Hit: data appears in some block in the upper level (example:

Block X) –  Hit Rate: the fraction of memory accesses found in the upper level –  Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss •  Miss: data needs to be retrieved from a block in the lower level

(Block Y) –  Miss Rate = 1 - (Hit Rate) –  Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block to the processor •  Hit Time << Miss Penalty (500 instructions on 21264!)

Lower Level Memory Upper Level

Memory To Processor

From Processor Blk X

Blk Y



Cache Performance

33

MEMcount

MEMALU

count

ALU

MEMALUcount

cyclecount

CPIIICPI

IICPI

IIITCPIIT

×⎟⎟⎠

⎞⎜⎜⎝

⎛+×⎟⎟

⎠

⎞⎜⎜⎝

⎛=

+=

××=

T = total execuVon Vme Tcycle = Vme for a single processor cycle Icount = total number of instrucVons IALU = number of ALU instrucVons (e.g. register – register) IMEM = number of memory access instrucVons ( e.g. load, store) CPI = average cycles per instrucVons CPIALU = average cycles per ALU instrucVons

CPIMEM = average cycles per memory instrucVon rmiss = cache miss rate rhit = cache hit rate CPIMEM-‐MISS = cycles per cache miss CPIMEM-‐HIT=cycles per cache hit MALU = instrucVon mix for ALU instrucVons MMEM = instrucVon mix for memory access instrucVon


Cache Performance

34

( )( )[ ] cycleMEMMEMALUALUcount

MEMMEMALUALU

MEMALU

count

MEMMEM

count

ALUALU

TCPIMCPIMITCPIMCPIMCPI

MMIIM

IIM

nMixInstructio

××+××=

×+×=

=+

=

=

)()(

1

:




Cache Performance

35



( ) ( )( )[ ] cycleMISSMEMMISSHITMEMMEMALUALUcount

MISSMEMMISSHITMEMMEM

TCPIrCPIMCPIMITCPIrCPICPI

××+×+××=

×+=

−−

−−


Cache Performance: Example

36

1100

5.01102

1010

11

=

=

=

=

×=

=

−

−

HITMEM

MISSMEM

cycle

ALU

MEM

count

CPICPI

nsTCPIII

2.010102

8.0108

10108

108

11

10

11

10

10

=×

==

==×

==

×=−=

count

MEMMEM

count

ALUALU

MEMcountALU

IIM

IIM

III

sec150105))112.0()18.0((10

11100)9.01(1

9.0

1011

=

×××+××=

=×−+=

×+=

=

−

−−−−

A

MISSMEMAMISSHITMEMAMEM

hitA

T

CPIrCPICPIr

sec550105))512.0()18.0((10

51100)5.01(1

5.0

1011

=

×××+××=

=×−+=

×+=

=

−

−−−−

B

MISSMEMBMISSHITMEMBMEM

hitB

T

CPIrCPICPIr


37


38

Topics



39

Motherboard Chipset

•  Provides core functionality of motherboard •  Embeds low-level protocols to facilitate efficient communication between

local components of computer system •  Controls the flow of data between the CPU, system memory, on-board

peripheral devices, expansion interfaces and I/O susbsystem •  Also responsible for power management features, retention of non-volatile

configuration data and real-time measurement •  Typically consists of:

–  Northbridge (Memory Controller Hub, MCH), managing traffic between the processor, RAM, GPU, southbridge and optionally PCI Express slots

–  Southbridge (I/O Controller Hub, ICH), coordinating slower set of devices, including traditional PCI bus, ISA bus, SMBus, IDE (ATA), DMA and interrupt controllers, real-time clock, BIOS memory, ACPI power management, LPC bridge (providing fan control, floppy disk, keyboard, mouse, MIDI interfaces, etc.), and optionally Ethernet, USB, IEEE1394, audio codecs and RAID interface


40

Major Chipset Vendors

•  Intel –  http://developer.intel.com/products/chipsets/index.htm

•  Via –  http://www.via.com.tw/en/products/chipsets

•  SiS –  http://www.sis.com/products/product_000001.htm

•  AMD/ATI –  http://ati.amd.com/products/integrated.html

•  Nvidia –  http://www.nvidia.com/page/mobo.html


41

Chipset Features Overview


42

Motherboard

•  Also referred to as main board, system board, backplane •  Provides mechanical and electrical support for pluggable

components of a computer system •  Constitutes the central circuitry of a computer,

distributing power and clock signals to target devices, and implementing communication backplane for data exchanges between them

•  Defines expansion possibilities of a computer system through slots accommodating special purpose cards, memory modules, processor(s) and I/O ports

•  Available in many form factors and with various capabilities to match particular system needs, housing capacity and cost


43

Motherboard Form Factors

•  Refer to standardized motherboard sizes •  Most popular form factor used today is ATX, evolved

from now obsolete AT (Advanced Technology) format •  Examples of other common form factors:

–  MicroATX, miniaturized version of ATX –  WTX, large form factor designated for use in high power

workstations/servers featuring multiple processors –  Mini-ITX, designed for use in thin clients –  PC/104 and ETX, used in embedded systems and single

board computers –  BTX (Balanced Technology Extended), introduced by Intel as

a possible successor to ATX


44

Motherboard Manufacturers

•  Abit •  Albatron •  Aopen •  ASUS •  Biostar •  DFI •  ECS •  Epox •  FIC •  Foxconn •  Gigabyte

•  IBM •  Intel •  Jetway •  MSI •  ShuTle •  Soyo •  SuperMicro •  Tyan •  VIA


45

Source: hTp://www.motherboards.org

Populated CPU Socket


46

Source: hTp://www.motherboards.org

DIMM Memory Sockets


47

Motherboard on Arete


48

Source: hTp://www.tyan.com

SuperMike Motherboard: Tyan Thunder i7500 (S720)


49

Topics



50

PCI enhanced systems

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1


51

PCI-express

Lane width

Clock speed

Throughput (duplex, bits)

Throughput (duplex, bytes)

Initial expected uses

x1 2.5 GHz 5 Gbps 400 MBps Slots, Gigabit Ethernet

x2 2.5 GHz 10 Gbps 800 MBps

x4 2.5 GHz 20 Gbps 1.6 GBps Slots, 10 Gigabit Ethernet, SCSI, SAS

x8 2.5 GHz 40 Gbps 3.2 GBps

x16 2.5 GHz 80 Gbps 6.4 GBps Graphics adapters

http://www.redbooks.ibm.com/abstracts/tips0456.html


52

PCI-X Bus Width Clock Speed Features Bandwidth

PCI-X 66 64 Bits 66 MHz Hot Plugging, 3.3 V 533 MB/s

PCI-X 133 64 Bits 133 MHz Hot Plugging, 3.3 V 1.06 GB/s

PCI-X 266

64 Bits, optional 16 Bits only

133 MHz Double Data Rate

Hot Plugging, 3.3 & 1.5 V, ECC supported 2.13 GB/s

PCI-X 533

64 Bits, optional 16 Bits only

133 MHz Quad Data Rate

Hot Plugging, 3.3 & 1.5 V, ECC supported 4.26 GB/s


53

Bandwidth Comparisons CONNECTION BITS BYTES

PCI 32-bit/33 MHz 1.06666 Gbit/s 133.33 MB/s





PCI Express (x1 link)[6] 2.5 Gbit/s 250 MB/s

PCI Express (x4 link)[6] 10 Gbit/s 1 GB/s

PCI Express (x8 link)[6] 20 Gbit/s 2 GB/s PCI Express (x16 link)[6] 40 Gbit/s 4 GB/s

PCI Express 2.0 (x32 link)[6] 80 Gbit/s 8 GB/s

PCI-X DDR 16-bit 4.26666 Gbit/s 533.33 MB/s

PCI-X 133 8.53333 Gbit/s 1.06666 GB/s

PCI-X QDR 16-bit 8.53333 Gbit/s 1.06666 GB/s

PCI-X DDR 17.066 Gbit/s 2.133 GB/s

PCI-X QDR 34.133 Gbit/s 4.266 GB/s

AGP 8x 17.066 Gbit/s 2.133 GB/s


54

HyperTransport : Context

•  Northbridge-Southbridge device connection facilitates communication over fast processor bus between system memory, graphics adaptor, CPU

•  Southbridge operates several I/O interfaces, through the Northbridge operating over another proprietary connection

•  This approach is potentially limited by the emerging bandwidth demands over inadequate I/O buses

•  HyperTransport is one of the many technologies aimed at improving I/O.

•  High data rates are achieved by using enhanced, low-swing, 1.2 V Low Voltage Differential Signaling (LVDS) that employs fewer pins and wires consequently reducing cost and power requirements.

•  HyperTransport also helps in communication between multiple AMD Opteron CPUs

http://www.amd.com/us-en/Processors/ComputingSolutions/0,,30_288_13265_13295%5E13340,00.html


55

Hyper-Transport (continued) •  Point-to-point parallel topology uses 2

unidirectional links (one each for upstream and downstream)

•  HyperTransport technology chunks data into packets to reduce overhead and improve efficiency of transfers.

•  Each HyperTransport technology link also contains 8-bit data path that allows for insertion of a control packet in the middle of a long data packet, thus reducing latency.

•  In Summary : “HyperTransport™ technology delivers the raw throughput and low latency necessary for chip-to-chip communication. It increases I/O bandwidth, cuts down the number of different system buses, reduces power consumption, provides a flexible, modular bridge architecture, and ensures compatibility with PCI. “

http://www.amd.com/us-en/Processors/ComputingSolutions /0,,30_288_13265_13295%5E13340,00.html


56

Topics



57

Performance Issues

•  Cache behavior –  Hit/miss rate –  Replacement strategies

•  Prefetching •  Clock rate •  ILP •  Branch prediction •  Memory

–  Access time –  Bandwidth


58

Topics



59


•  Please make sure that you have addressed all points outlined on slide 5

•  Understand content on slide 7 •  Understand concepts, equations, problems on

slides 11, 12, 13 •  Understand content on 21, 24, 26, 29 •  Understand concepts on slides 32,33,34,35,36 •  Understand content on slides 39, 57

•  Required reading material :

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1


60

CSC 7600 Lecture 11 : Pthreads Spring 2011


Pthreads



2

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary


3

Topics



Opening Remarks

•  We now have a good picture of supercomputer architecture –  including SMP structures

•  which are the building blocks of most HPC systems on the Top-500 List

•  We were introduced to the first two programming methods for exploiting parallelism –  Capacity Computing - Condor –  Co-operative Computing - MPI

•  Now we explore a 3rd programming model: multithreaded computing on shared memory systems –  This time: general principles and POSIX Pthreads –  Next time: OpenMP

4


What you’ll Need to Know

•  Modeling time to execution with CPI •  Multi-thread programming and execution concepts

–  Parallelism with multiple threads –  Synchronization –  Memory consistency models

•  Basic Pthread commands •  Dangers

–  Race conditions –  Deadlock

5


6

Topics



7

CPI

rate miss cache nsinstructiomemory for timeexecution

nsinstructioregister for timeexecution timeexecution

timecycle nsinstructiomemory executed ofnumber

nsinstructioregister executed ofnumber nsinstructio executed ofnumber

penalty) (miss miss cache with operationsmemory for cpi hit cache with operationsmemory for cpi

operationsmemory for cpi operationsregister for cpi

n instructioper cycles

≡

≡

≡

≡

≡

≡

≡

≡

≡

≡

≡

≡

≡

miss

M

R

c

M

R

Mmiss

Mhit

M

R

rTTTt

I#I#I#

cpicpicpicpicpi


8

CPI (continued)

( )( )( )( ) cMmissmissMhitmissMRR

MmissmissMhitmissM

MRMMRR

MM

RR

c

tcpircpirmcpimI# Tcpircpircpi

mmcpimcpimcpiI#I#m

I#I#mtcpiI#T

××+×−×+××=

×+×−=

=+×+×=

≡

≡

××=

11

0.1 where


An Example

Robert hates parallel compu;ng and runs all of his jobs on a single processor core on his Acme computer. His current applica;on plays solitaire because he is too lazy to flip the cards himself. The machine he is running on has a 2 GHz clock. For this problem the basic register opera;ons make up only 75% of the instruc;on mix but delivers one and a half instruc;ons per cycle while the load and store opera;ons yield one per cycle. But his cache hit rate is only 80% and the average penalty for not finding data in the L1 cache is 120 nanoseconds. A counter on the Acme processor tells Robert that it takes approximately 16 billion instruc;on execu;ons to run his short program. How long does it take to execute Robert’s applica;on?

9


And the answer is …

( )

( ) ( )

( ) ( )( )( ) seconds 6.1017.1281052.125.0106.1

1052402.018.025.03/275.0106.1

25.0cycles 240ns 120cycles/ns 2

13/2

2.018.0snanosecond 5.0GHz 0.2_

000,000,000,16#

1010

1010

=×=××+×=

×××+××+××=

=⇒=

=×=

=

=

=⇒−==

=⇒=

=

−

−

TT

m0.75mcpicpicpi

rrrtrateclock

I

MR

Mmiss

Mhit

R

missmisshit

c

10


11

Topics



Address Space

Thread 1

Address Space

global data

UNIX Processes vs. Multithreaded Programs

12

exec. state

stack

PID

text

Address Space

global data

exec. state

stack

PID1

text

Copy of PID1’s Address Space

global data

exec. state

stack

PID2

text

fork()

shared data

exec. state

stack

PID

text

private data

Thread 2

exec. state

stack

private data

thread create

Thread m

Standard UNIX process

(single-‐threaded) New process spawned via fork() Mul;threaded Applica;on


13

Anatomy of a Thread

Thread (or, more precisely: thread of execu-on) is typically described as a lightweight process. There are, however, significant differences in the way standard processes and threads are created, how they interact and access resources. Many aspects of these are implementa;on dependent.

Private state of a thread includes: •  Execu;on state (instruc;on pointer, registers) •  Stack •  Private variables (typically allocated on thread’s stack)

Threads share access to global data in applica;on’s address space.


14

Topics



15

Race Conditions

Example: consider the following piece of pseudo-‐code to be executed concurrently by threads T1 and T2 (the ini;al value of memory loca;on A is x)

A→R: read memory location A into register R R++: increment register R A←R: write R into memory location A

Scenario 1: Step 1) T1:(A→R) → T1:R=x Step 2) T1:(R++) → T1:R=x+1 Step 3) T1:(A←R) → T1:A=x+1 Step 4) T2:(A→R) → T2:R=x+1 Step 5) T2:(R++) → T2:R=x+2 Step 6) T2:(A←R) → T2:A=x+2

Scenario 2: Step 1) T1:(A→R) → T1:R=x Step 2) T2:(A→R) → T2:R=x Step 3) T1:(R++) → T1:R=x+1 Step 4) T2:(R++) → T2:R=x+1 Step 5) T1:(A←R) → T1:A=x+1 Step 6) T2:(A←R) → T2:A=x+1

Since threads are scheduled arbitrarily by an external en;ty, the lack of explicit synchroniza;on may cause different outcomes.

Race condition (or race hazard) is a flaw in system or process whereby the output of the system or process is unexpectedly and critically dependent on the sequence or timing of other events.

Suggested reading: hdp://en.wikipedia.org/wiki/Race_condi;on


Critical Sections

16

Critical section is a segment of code accessing a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution.

Suggested reading: hdp://en.wikipedia.org/wiki/Cri;cal_sec;on

critical section

The implementa;on of cri;cal sec;on must prevent any change of processor control once the execu;on enters the cri;cal sec;on.

•  Code on uniprocessor systems may rely on disabling interrupts and avoiding system calls leading to context switches, restoring the interrupt mask to the previous state upon exit from the cri;cal sec;on

•  General solu;ons rely on synchroniza;on mechanisms (hardware-‐assisted when possible), discussed on the next slides


Thread Synchronization Mechanisms

•  Based on atomic memory operation (require hardware support) –  Spinlocks –  Mutexes (and condition variables) –  Semaphores –  Derived constructs: monitors, rendezvous, mailboxes, etc.

•  Shared memory based locking –  Dekker’s algorithm

http://en.wikipedia.org/wiki/Dekker%27s_algorithm

–  Peterson’s algorithm http://en.wikipedia.org/wiki/Peterson%27s_algorithm

–  Lamport’s algorithm http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm http://research.microsoft.com/users/lamport/pubs/bakery.pdf

17


Spinlocks

•  Spinlock is the simplest kind of lock, where a thread waiting for the lock to become available repeatedly checks lock’s status

•  Since the thread remains active, but doesn’t perform a useful computation, such a lock is essentially busy-waiting, and hence generally wasteful

•  Spinlocks are desirable in some scenarios: –  If the waiting time is short, spinlocks save the overhead and cost of context

switches, required if other threads have to be scheduled instead –  In real-time system applications, spinlocks offer good and predictable

response time

•  Typically use fair scheduling of threads to work correctly •  Spinlock implementations require atomic hardware primitives,

such as test-and-set, fetch-and-add, compare-and-swap, etc.

18

Suggested reading: hdp://en.wikipedia.org/wiki/Spinlock


Mutexes

•  Mutex (abbreviation for mutual exclusion) is an algorithm used to prevent concurrent accesses to a common resource. The name also applies to the program object which negotiates access to that resource.

•  Mutex works by atomically setting an internal flag when a thread (mutex owner) enters a critical section of the code. As long as the flag is set, no other threads are permitted to enter the section. When the mutex owner completes operations within the critical section, the flag is (atomically) cleared.

19

Suggested reading: hdp://en.wikipedia.org/wiki/Mutex

lock(mutex) critical section unlock(mutex)


Condition Variables •  Condition variables are frequently used in association with mutexes to increase

the efficiency of execution in multithreaded environments •  Typical use involves a thread or threads waiting for a certain condition (based on

the values of variables inside the critical section) to occur. Note that: –  The thread cannot wait inside the critical section, since no other thread would be

permitted to enter and modify the variables –  The thread could monitor the values by repeatedly accessing the critical section

through its mutex; such a solution is typically very wasteful •  Condition variable permits the waiting thread to temporarily release the mutex it

owns, and provide the means for other threads to communicate the state change within the critical section to the waiting thread (if such a change occurred)

20

/* waiting thread code: */ lock(mutex); /* check if you can progress */ while (condition not true) wait(cond_var); /* now you can; do your work */ ... unlock(mutex);

/* modifying thread code: */ lock(mutex); /* update critical section variables */ ... /* announce state change */ signal(cond_var); unlock(mutex);


Semaphores •  Semaphore is a protected variable introduced by Edsger Dijkstra (in the “THE”

operating system) and constitutes the classic method for restricting access to shared resource

•  It is associated with an integer variable (semaphore’s value) and a queue of waiting threads

•  Semaphore can be accessed only via the atomic P and V primitives:

•  Usage: –  Semaphore’s value S.v is initialized to a positive number –  Semaphore’s queue S.q is initially empty –  Entrance to critical section is guarded by P(S) –  When exiting critical section, V(S) is invoked –  Note: mutex can be implemented as a binary semaphore

21

P(semaphore S) { if S.v > 0 then S.v := S.v-1; else { insert current thread in S.q; change its state to blocked; schedule another thread; } }

V(semaphore S) { if S.v = 0 and not empty(S.q) then { pick a thread T from S.q; change T’s state to ready; } else S.v := S.v+1; }

Suggested reading: hdp://www.mcs.drexel.edu/~shartley/OSusingSR/semaphores.html hdp://en.wikipedia.org/wiki/Semaphore_(programming)


Disadvantages of Locks

•  Blocking mechanism (forces threads to wait) •  Conservative (lock has to be acquired when there’s only a

possibility of access conflict) •  Vulnerable to faults and failures (what if the owner of the lock

dies?) •  Programming is difficult and error prone (deadlocks, starvation) •  Does not scale with problem size and complexity •  Require balancing the granularity of locked data against the cost

of fine-grain locks •  Not composable •  Suffer from priority inversion and convoying •  Difficult to debug

22

Reference: hdp://en.wikipedia.org/wiki/Lock_(computer_science)


23

Topics



24

Shared Memory Consistency Model

•  Defines memory functionality related to read and write operations by multiple processors –  Determines the order of read values in response to the order of

write values by multiple processors –  Enables the writing of correct, efficient, and repeatable shared

memory programs •  Establishes a formal discipline that places restrictions on

the values that can be returned by a read in a shared-memory program execution –  Avoids non-determinacy in memory behavior –  Provides a programmer perspective on expected behavior –  Imposes demands on system memory operation

•  Two general classes of consistency models: –  Sequential consistency –  Relaxed consistency


25

Sequential Consistency Model

•  Most widely adopted memory model •  Required:

–  Maintaining program order among operations from individual processors

–  Maintaining a single sequential order among operations from all processors

•  Enforces effect of atomic complex memory operations –  Enables compound atomic operations –  Avoids race conditions –  Precludes non-determinacy from dueling processors


26

Relaxed Consistency Models

•  Sequential consistency over-constrains parallel execution limiting parallel performance and scalability –  Critical sections impose sequential bottlenecks –  Amdahl’s Law applies imposing upper bound on performance

•  Relaxed consistency models permit optimizations not possible under limitations of sequential consistency

•  Forms of relaxed consistency –  Program order

•  Write to read •  Write to write •  Read to following read or write

–  Write atomicity •  Read value of its own previous write prior to being visible to all

other processors


27

Topics



Dining Philosophers Problem

28

Description: •  N philosophers (N > 3) spend their time eating and thinking at the round table •  There are N plates and N forks (or chopsticks, in some versions) between the plates •  Eating requires two forks, which may be picked one at a time, at each side of the plate •  When any of the philosophers is done eating, he starts thinking •  When a philosopher becomes hungry, he attempts to start eating •  They do it in complete silence as to not disturb each other (hence no communication to synchronize their actions is possible)

A varia;on on Edsger Dijkstra’s five computers compe;ng for access to five shared tape drives problem (introduced in 1971), retold by Tony Hoare.

Problem: How must they acquire/release forks to ensure that each of them maintains a healthy balance between meditation and eating?


What Can Go Wrong at the Philosophers Table?

•  Deadlock If all philosophers decide to eat at the same time and pick forks at the same side of their plates, they are stuck forever waiting for the second fork.

•  Livelock Livelock frequently occurs as a consequence of a poorly thought out deadlock prevention strategy. Assume that all philosophers: (a) wait some length of time to put down the fork they hold after noticing that they are unable to acquire the second fork, and then (b) wait some amount of time to reacquire the forks. If they happen to get hungry at the same time and pick one fork using scenario leading to a deadlock and all (a) and (b) timeouts are set to the same value, they won’t be able to progress (even though there is no actual resource shortage).

•  Starvation There may be at least one philosopher unable to acquire both forks due to timing issues. For example, his neighbors may alternately keep picking one of the forks just ahead of him and take advantage of the fact that he is forced to put down the only fork he was able to get hold of due to deadlock avoidance mechanism.

29


30

Priority Inversion

•  How it happens: –  A low priority thread locks the mutex for some shared resource –  A high priority thread requires access to the same resource (waits for the

mutex) –  In the meantime, a medium priority thread (not depending on the common

resource) gets scheduled, preempting the low priority thread and thus preventing it from releasing the mutex

•  A classic occurrence of this phenomenon lead to system reset and subsequent loss of data in Mars Pathfinder mission in 1997: http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder.html

Priority inversion is the scenario where a low priority thread holds a shared resource that is required by a high priority thread.

Suggested reading: hdp://en.wikipedia.org/wiki/Priority_inversion


31

Spurious Wakeups

•  Spurious wakeup is a phenomenon associated with a thread waiting on a condition variable

•  In most cases, such a thread is supposed to return from call to wait() only if the condition variable has been signaled or broadcast

•  Occasionally, the waiting thread gets unblocked unexpectedly, either due to thread implementation performance trade-offs, or scheduler deficiencies

•  Lesson: upon exit from wait(), test the predicate to make sure the waiting thread indeed may proceed (i.e., the data it was waiting for have been provided). The side effect is a more robust code.

Suggested reading: hdp://en.wikipedia.org/wiki/Spurious_wakeup


Thread Safety A code is thread-safe if it functions correctly during simultaneous execution by multiple threads.

•  Indicators helpful in determining thread safety –  How the code accesses global variables and heap –  How it allocates and frees resources that have global limits –  How it performs indirect accesses (through pointers or handles) –  Are there any visible side effects

•  Achieving thread safety

–  Re-entrancy: property of code, which may be interrupted during execution of one task, reentered to perform another, and then resumed on its original task without undesirable effects

–  Mutual exclusion: accesses to shared data are serialized to ensure that only one thread performs critical state update. Acquire locks in an identical order on all threads

–  Thread-local storage: as much of the accessed data as possible should be placed in thread’s private variables

–  Atomic operations: should be the preferred mechanism of use when operating on shared state

32


33

Topics



Common Approaches to Thread Implementation

•  Kernel threads •  User-space threads •  Hybrid implementations

34

References: 1. POSIX Threads on HP-UX 11i, http://devresource.hp.com/drc/resources/pthread_wp_jul2004.pdf 2. SunOS Multi-thread Architecture by M. L. Powell, S. R. Kleinman, et al. http://opensolaris.org/os/project/muskoka/doc_attic/mt_arch.pdf


Kernel Threads

•  Also referred to as Light Weight Processes •  Known to and individually managed by the kernel •  Can make system calls independently •  Can run in parallel on a multiprocessor (map directly onto

available execution hardware) •  Typically have wider range of scheduling capabilities •  Support preemptive multithreading natively •  Require kernel support and resources •  Have higher management overhead

35


User-space Threads

•  Also known as fibers or coroutines •  Operate on top of kernel threads, mapped to them via user-space

scheduler •  Thread manipulations (“context switches”, etc.) are performed entirely

in user space •  Usually scheduled cooperatively (i.e., non-preemptively), complicating

the application code due to inclusion of explicit processor yield statements

•  Context switches cost less (on the order of subroutine invocation) •  Consume less resources than kernel threads; their number can be

consequently much higher without imposing significant overhead •  Blocking system calls present a challenge and may lead to inefficient

processor usage (user-space scheduler is ignorant of the occurrence of blocking; no notification mechanism exists in kernel either)

36


MxN Threading

•  Available on NetBSD , HPUX an Solaris to complement the existing 1x1 (kernel threads only) and Mx1 (multiplexed user threads) libraries

•  Multiplex M lightweight user-space threads on top of N kernel threads, M > N (sometimes M >> N)

•  User threads are unbound and scheduled on Virtual Processors (which in turn execute on kernel threads); user thread may effectively move from one kernel thread to another in its lifetime

•  In some implementations Virtual Processors rely on the concept of Scheduler Activations to deal with the issue of user-space threads blocking during system calls

37


38

Scheduler Activations •  Developed in 1991 at the University of Washington •  Typically used in implementations involving user-space threads •  Require kernel cooperation in form of a lightweight upcall mechanism to

communicate blocking and unblocking events to the user-space scheduler

Reference: T. Anderson, B. Bershad, E. Lazowska and H. Levy, Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism, http://www.cs.washington.edu/homes/bershad/Papers/p53-anderson.pdf

–  Unbound user threads are scheduled on Virtual Processors (which in turn execute on kernel threads) –  A user thread may effectively move from one kernel thread to another in its lifetime –  Scheduler Activation resembles and is scheduled like a kernel thread –  Scheduler Activation provides its replacement to the user-space scheduler when the unbound thread invokes a blocking operation in the kernel –  The new Scheduler Activation continues the operations of the same VP


39

Examples of Multi-Threaded System Implementations

•  The most commonly used thread package on Linux is Native POSIX Thread Library (NPTL)

–  Requires kernel version 2.6 –  1x1 model, mapping each application thread to a kernel thread –  Bundled by default with recent versions of glibc –  High-performance implementation –  POSIX (Pthreads) compliant

•  Most of the prominent operating systems feature their own thread implementations, for example:

–  FreeBSD: three thread libraries, each supporting different execution model (user-space, 1x1, MxN with scheduler activations)

–  Solaris: kernel-level execution through LWPs (Lightweight Processes); user threads execute in context of LWPs and are controlled by system library

–  HPUX: Pthreads compliant MxN implementation –  MS Windows: threads as smallest kernel-level execution objects, fibers as smallest user-

level execution objects controlled by the programmer; many-to-many scheduling supported •  There are numerous open-source thread libraries (mostly for Linux): LinuxThreads,

GNU Pth, Bare-Bone Threads, FSU Pthreads, DCEthreads, Nthreads, CLthreads, PCthreads, LWP, QuickThreads, Marcel, etc.


40

Topics



POSIX Threads (Pthreads) •  POSIX Threads define POSIX standard for multithreaded API (IEEE POSIX

1003.1-1995) •  The functions comprising core functionality of Pthreads can be divided into

three classes: –  Thread management –  Mutexes –  Condition variables

•  Pthreads define the interface using C language types, function prototypes and macros

•  Naming conventions for identifiers: –  pthread_: Threads themselves and miscellaneous subroutines –  pthread_attr_: Thread attributes objects –  pthread_mutex_: Mutexes –  pthread_mutexattr_: Mutex attributes objects –  pthread_cond_: Condition variables –  pthread_condattr_: Condition attributes objects –  pthread_key_: Thread-specific data keys

41

References: 1. http://www.llnl.gov/computing/tutorials/pthreads/ 2. http://www.opengroup.org/onlinepubs/007908799/xsh/pthread.h.html


Programming with Pthreads The scope of this short tutorial is: •  General thread management •  Synchronization

–  Mutexes –  Condition variables

•  Miscellaneous functions

42


43

Pthreads: Thread Creation

Function: pthread_create()

int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*routine)(void *), void *arg); Description: Creates a new thread within a process. The created thread starts execution of routine, which is passed a pointer argument arg. The attributes of the new thread can be specified through attr, or left at default values if attr is null. Successful call returns 0 and stores the id of the new thread in location pointed to by thread, otherwise an error code is returned.

#include <pthread.h> ... void *do_work(void *input_data) { /* this is thread’s starting routine */ ... } ... pthread_t id; struct {. . .} args = {. . .}; /* struct containing thread arguments */ int err; ... /* create new thread with default attributes */ err = pthread_create(&id, NULL, do_work, (void *)&args); if (err != 0) {/* handle thread creation failure */} ...


44

Pthreads: Thread Join

Function: pthread_join()

int pthread_join(pthread_t thread, void **value_ptr);

Description: Suspends the execution of the calling thread until the target thread terminates (either by returning from its startup routine, or calling pthread_exit()), unless the target thread already terminated. If value_ptr is not null, the return value from the target thread or argument passed to pthread_exit() is made available in location pointed to by value_ptr. When pthread_join() returns successfully (i.e. with zero return code), the target thread has been terminated.

#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... void *result_ptr; int err; ... /* create worker thread */ pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_join(id, &result_ptr); if (err != 0) {/* handle join error */} else {/* the worker thread is terminated and result_ptr points to its return value */ ... }


45

Pthreads: Thread Exit Function: pthread_exit()

void pthread_exit(void *value_ptr);

Description: Terminates the calling thread and makes the value_ptr available to any successful join with the terminating thread. Performs cleanup of local thread environment by calling cancellation handlers and data destructor functions. Thread termination does not release any application visible resources, such as mutexes and file descriptors, nor does it perform any process-level cleanup actions.

#include <pthread.h> ... void *do_work(void *args) { ... pthread_exit(&return_value); /* the code following pthread_exit is not executed */ ... } ... void *result_ptr; pthread_t id; pthread_create(&id, NULL, do_work, (void *)&args); ... pthread_join(id, &result); /* result_ptr now points to return_value */ ...


46

Pthreads: Thread Termination

Function: pthread_cancel()

void pthread_cancel(thread_t thread);

Description: The pthread_cancel() requests cancellation of thread thread. The ability to cancel a thread is dependent on its state and type.

#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... pthread_t id; int err; pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_cancel(id); if (err != 0) {/* handle cancelation failure */} ...


47

Pthreads: Detached Threads

Function: pthread_detach()

int pthread_detach(pthread_t thread);

Description: Indicates to the implementation that storage for thread thread can be reclaimed when the thread terminates. If the thread has not terminated, pthread_detach() is not going to cause it to terminate. Returns zero on success, error number otherwise.

#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... pthread_t id; int err; ... /* start a new thread */ pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_detach(id); if (err != 0) {/* handle detachment failure */} else {/* master thread doesn’t join the worker thread; the worker thread resources will be released automatically after it terminates */ ... }


48

Pthreads: Operations on Mutex Objects (I)

#include <pthread.h> ... pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; ... /* lock the mutex before entering critical section */ pthread_mutex_lock(&mutex); /* critical section code */ ... /* leave critical section and release the mutex */ pthread_mutex_unlock(&mutex); ...

Function: pthread_mutex_lock(), pthread_mutex_unlock()

int pthread_mutex_lock(pthread_mutex_t *mutex); int pthread_mutex_unlock(pthread_mutex_t *mutex); Description: The mutex object referenced by mutex shall be locked by calling pthread_mutex_lock(). If the mutex is already locked, the calling thread blocks until the mutex becomes available. After successful return from the call, the mutex object referenced by mutex is in locked state with the calling thread as its owner. The mutex object referenced by mutex is released by calling pthread_mutex_unlock(). If there are threads blocked on the mutex, scheduling policy decides which of them shall acquire the released mutex.


49

Pthreads: Operations on Mutex Objects (II)

#include <pthread.h> ... pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; int err; ... /* attempt to lock the mutex */ err = pthread_mutex_trylock(&mutex); switch (err) { case 0: /* lock acquired; execute critical section code and release mutex */ ... pthread_mutex_unlock(&mutex); break; case EBUSY: /* someone already owns the mutex; do something else instead of blocking */ ... break; default: /* some other failure */ ... break; }

Function: pthread_mutex_trylock()

int pthread_mutex_trylock(pthread_mutex_t *mutex);

Description: The function pthread_mutex_trylock() is equivalent to pthread_mutex_lock() , except that if the mutex object is currently locked, the call returns immediately with an error code EBUSY. The value of 0 (success) is returned only if the mutex has been acquired.


Pthread Mutex Types

•  Normal –  No deadlock detection on attempts to relock already locked mutex

•  Error-checking –  Error returned when locking a locked mutex

•  Recursive –  Maintains lock count variable –  After the first acquisition of the mutex, the lock count is set to one –  After each successful relock, the lock count is increased; after each

unlock, it is decremented –  When the lock count drops to zero, thread loses the mutex

ownership •  Default

–  Attempts to lock the mutex recursively result in an undefined behavior

–  Attempts to unlock the mutex which is not locked, or was not locked by the calling thread, results in undefined behavior

50


51

Pthreads: Condition Variables Function: pthread_cond_wait(),

pthread_cond_signal(), pthread_cond_broadcast() int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex); int pthread_cond_signal(pthread_cond_t *cond); Int pthread_cond_broadcast(pthread_cond_t *cond); Description: The pthread_cond_wait() blocks on a condition variable associated with a mutex. The function must be called with a locked mutex argument. It atomically releases the mutex and causes the calling thread to block. While in that state, another thread is permitted to access the mutex. Subsequent mutex release should be announced by the accessing thread through pthread_cond_signal() or pthread_cond_broadcast(). Upon successful return from pthread_cond_wait(), the mutex is in locked state with the calling thread as its owner. The pthread_cond_signal() unblocks at least one of the threads that are blocked on the specified condition variable cond. The pthread_cond_broadcast() unblocks all threads currently blocked on the specified condition variable cond. All of these functions return zero on successful completion, or an error code otherwise.


52

Example: Condition Variable

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; /* create default mutex */ pthread_cond_t cond = PTHREAD_COND_INITIALIZER; /* create default condition variable */ pthread_t prod_id, cons_id; item_t buffer; /* storage buffer (shared access) */ int empty = 1; /* buffer empty flag (shared access) */ ... pthread_create(&prod_id, NULL, producer, NULL); /* start producer thread */ pthread_create(&cons_id, NULL, consumer, NULL); /* start consumer thread */ ...

void *producer(void *none) { while (1) { /* obtain next item, asynchronously */ item_t item = compute_item(); pthread_mutex_lock(&mutex); /* critical section starts here */ while (!empty) /* wait until buffer is empty */ pthread_cond_wait(&cond, &mutex); /* store item, update status */ buffer = item; empty = 0; /* wake waiting consumer (if any) */ pthread_condition_signal(&cond); /* critical section done */ pthread_mutex_unlock(&mutex); } }

void *consumer(void *none) { while (1) { item_t item; pthread_mutex_lock(&mutex); /* critical section starts here */ while (empty) /* block (nothing in buffer yet) */ pthread_cond_wait(&cond, &mutex); /* grab item, update buffer status */ item = buffer; empty = 1; /* critical section done */ pthread_condition_signal(&cond); pthread_mutex_unlock(&mutex); /* process item, asynchronously */ consume_item(item); } }

Ini;aliza;on and startup

Simple producer thread Simple consumer thread


53

Pthreads: Dynamic Initialization

Function: pthread_once()

int pthread_once(pthread_once_t *control, void (*init_routine)(void));

Description: The first call to pthread_once() by any thread in a process will call the init_routine() with no arguments. Subsequent calls to pthread_once() with the same control will not call init_routine().

#include <pthread.h> ... pthread_once init_ctrl = PTHREAD_ONCE_INIT; ... void initialize() {/* initialize global variables */} ... void *do_work(void *arg) { /* make sure global environment is set up */ pthread_once(&init_ctrl, initialize); /* start computations */ ... } ... pthread_t id; pthread_create(&id, NULL, do_work, NULL); ...


54

Pthreads: Get Thread ID

Function: pthread_self()

pthread_t pthread_self(void);

Description: Returns the thread ID of the calling thread.

#include <pthread.h> ... pthread_t id; id = pthread_self(); ...


55

Topics




•  Performance & cpi: slide 8 •  Multi thread concepts: 13, 16, 18, 19, 22, 24, 31 •  Thread implementations: 35 – 37 •  Pthreads: 43 – 45, 48

56


57

CSC 7600 Lecture 12 : OpenMPSpring 2011


MEANS

OPENMP

Prof. Thomas Sterling

Department of Computer Science

Louisiana State University

February 24, 2011



Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands


2



Topics










3



Where are we? (Take a deep breath …)

• 3 classes of parallel/distributed computing

– Capacity

– Capability

– Cooperative

• 3 classes of parallel architectures (respectively)

– Loosely coupled clusters and workstation farms

– Tightly coupled vector, SIMD, SMP

– Distributed memory MPPs (and some clusters)

• 3 classes of parallel execution models (respectively)

– Workflow, throughput, SPMD (ssh)

– Multithreaded with shared memory semantics (Pthreads)

– Communicating Sequential Processes (sockets)

• 3 classes of programming models

– Condor (Segment 1)

– OpenMP (Segment 3)

– MPI (Segment 2)

You Are Here

4



HPC Modalities

5

Modalities Degree of Integration

Architectures Execution Models

Programming Models

Capacity Loosely Coupled Clusters & Workstation farms

Workflow Throughput

Condor

Capability Tightly Coupled Vectors, SMP, SIMD

Shared Memory Multithreading

OpenMP

Cooperative Medium DM MPPs & Clusters

CSP MPI



Topics










6



7

Amdahl’s Law

g

ff

S

Tg

fTf

TS

Tg

fTfT

TTf

TTS

S

f

g

T

T

T

OO

O

OOA

OF

AO

F

A

O

1

1

1

1

appliedon acceleratin with computatio of up speed

daccelerate be n tocomputatio daccelerate-non offraction

ncomputatio ofportion dacceleratefor gain eperformancpeak

daccelerate becan n that computatio ofportion of time

ncomputatio dacceleratefor time

ncomputatio daccelerate-nonfor time

start end

TO

TF

start end

TA

TF/g



Performance : Caches & Locality

• Temporal Locality is a property that if a program accesses a

memory location, there is a much higher than random probability

that the same location would be accessed again.

• Spatial Locality is a property that if a program accesses a

memory location, there is a much higher than random probability

that the nearby locations would be accessed soon.

• Spatial locality is usually easier to achieve than temporal locality

• A couple of key factors affect the relationship between locality

and scheduling :

– Size of dataset being processed by each processor

– How much reuse is present in the code processing a chunk of

iterations.

8



Performance Shared Memory (OpenMP): Key

Factors

• Load Balancing :

– mapping workloads with thread scheduling

• Caches :

– Write-through

– Write-back

• Locality :

– Temporal Locality

– Spatial Locality

• How Locality affects scheduling algorithm selection

• Synchronization :

– Effect of critical sections on performance

9



Performance : Caches & Locality

• Caches (Review) :

– for a C statement :

• a[i] = b[i]+c[i]

– the system accesses the memory locations referenced by b[i] and c[i] to the

processor, the result of the computation is subsequently stored in the memory

location referenced by a[i]

• Write-through caches: When a user writes some data, the data is immediately

written back to the memory, thus maintaining the cache-memory consistency.

In write through caches data in caches always reflect the data in the memory.

One of the main issues in write through caches is the increase in system

overhead required due to moving of large data between cache and memory.

• Write-back caches : When a user writes some data, the data is stored in the

cache and is not synchronized with the memory. Instead when the cache

content is different than the memory content, a bit entry is made in the cache.

While cleaning up caches the system checks for the entry in cache and if the

bit is set the system writes the changes to the memory.

10



Topics










11



Introduction

• OpenMP is :– an API (Application Programming Interface)

– NOT a programming language

– A set of compiler directives that help the application developer to parallelize their workload.

– A collection of the directives, environment variables and the library routines

• OpenMP is composed of the following main components : – Directives

– Runtime library routines

– Environment variables

12



Components of OpenMP

Environment variables

Number of threads

Scheduling type

Dynamic thread adjustment

Nested Parallelism

13

Directives

Parallel regions

Work sharing

Synchronization

Data scope attributes :• private• firstprivate• last private• shared• reduction

Orphaning

Runtime library routines

Number of threads

Thread ID

Dynamic thread adjustment

Nested Parallelism

Timers

API for locking



OpenMP Architecture

14

Operating System level Threads

OpenMP Runtime Library

Application

Environment Variables

User

Compiler Directives

Inspired by OpenMp.org introductory slides



Topics










15



Runtime Library Routines

• Runtime library routines help manage parallel programs

• Many runtime library routines have corresponding environment

variables that can be controlled by the users

• Runtime libraries can be accessed by including omp.h in

applications that use OpenMP : #include <omp.h>

• For example for calls like :

– omp_get_num_threads(), (by which an openMP program determines

the number of threads available for execution) can be controlled using

an environment variable set at the command-line of a shell

($OMP_NUM_THREADS)

• Some of the activities that the OpenMP libraries help manage are :

– Determining the number of threads/processors

– Scheduling policies to be used

– General purpose locking and portable wall clock timing routines

16



17

OpenMP : Runtime Library

Function: omp_get_num_threads()

C/ C++ int omp_get_num_threads(void);

Fortran integer function omp_get_num_threads()

Description:

Returns the total number of threads currently in the group executing the parallel

block from where it is called.

Function: omp_get_thread_num()

C/ C++ int omp_get_thread_num(void);

Fortran integer function omp_get_thread_num()

Description:

For the master thread, this function returns zero. For the child nodes the call returns

an integer between 1 and omp_get_num_threads()-1 inclusive.



OpenMP Environment Variables

• OpenMP provides 4 main environment variables for

controlling execution of parallel codes:

OMP_NUM_THREADS – controls the parallelism of the

OpenMP application

OMP_DYNAMIC – enables dynamic adjustment of number of

threads for execution of parallel regions

OMP_SCHEDULE – controls the load distribution in loops such

as do, for

OMP_NESTED – Enables nested parallelism in OpenMP

applications

18




19

Environment

Variable:

OMP_NUM_THREADS

Usage :

bash/sh/ksh:

csh/tcsh

OMP_NUM_THREADS n

export OMP_NUM_THREADS=8

setenv OMP_NUM_THREADS 8

Description:

Sets the number of threads to be used by the OpenMP program during execution.

Environment

Variable:

OMP_DYNAMIC

Usage :

bash/sh/ksh:

csh/tcsh

OMP_DYNAMIC {TRUE|FALSE}

export OMP_DYNAMIC=TRUE

setenv OMP_DYNAMIC “TRUE”

Description:

When this environment variable is set to TRUE the maximum number of threads available

for use by the OpenMP program is n ($OMP_NUM_THREADS).




20

Environment

Variable:

OMP_SCHEDULE

Usage :

bash/sh/ksh:

csh/tcsh

OMP_SCHEDULE “schedule,[chunk]”

export OMP_SCHEDULE static,N/P

setenv OMP_SCHEDULE=“GUIDED,4”

Description:

Only applies to for and parallel for directives. This environment variable sets the

schedule type and chunk size for all such loops. The chunk size can be provided as an

integer number, the default being 1.

Environment

Variable:

OMP_NESTED

Usage :

bash/sh/ksh:

csh/tcsh

OMP_NESTED {TRUE|FALSE}

export OMP_NESTED FALSE

setenv OMP_NESTED=“FALSE”

Description:

Setting this environment variable to TRUE enables multi-threaded execution of inner

parallel regions in nested parallel regions.



OpenMP : Basic Constructs

C / C++ :

#pragma omp parallel {

parallel block

} /* omp end parallel */

21

OpenMP Execution Model (FORK/JOIN):

Sequential Part (master thread)

Parallel Region (FORK : group of threads)

Sequential Part (JOIN: master thread)

Parallel Region (FORK: group of threads)

Sequential Part (JOIN : master thread)

To invoke library routines in C/C++ add

#include <omp.h> near the top of your code



HelloWorld in OpenMP

22

#include <omp.h>

main () {

int nthreads, tid;

#pragma omp parallel private(nthreads, tid){

tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);if (tid == 0) {

nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);

}}

}

Code segment that will be executed in parallel

OpenMP directive to indicate START segment to be parallelized

OpenMP directive to indicate END segment to be parallelized

Non shared copies of data for each thread



OpenMP Execution

• On encountering the C construct #pragma omp parallel{, n-1 extra threads are created

• omp_get_thread_num() returns a unique identifier for each thread that can be utilized. The value returned by this call is between 0 and (OMP_NUM_THREADS – 1)

• omp_get_num_threads() returns the total number of threads involved in the parallel section of the program

• Code after the parallel directive is executed independently on each of the nthreads.

• On encountering the C construct } (corresponding to #pragma omp parallel{ ), indicates the end of parallel execution of the code segment, the n-1 extra threads are deactivated and normal sequential execution begins.

23



Compiling OpenMP Programs

Fortran :

• Case insensitive directives

• Syntax :

– !$OMP directive [clause[[,] clause]…] (free format)

– !$OMP / C$OMP / *$OMP directive [clause[[,] clause]…] (free format)

• Compiling OpenMP source code :

– (GNU Fortran compiler) : gfortran –fopenmp –o exec_name file_name.f95

– (Intel Fortran compiler) : ifort -o exe_file_name –openmp file_name.f

24

C :

• Case sensitive directives

• Syntax :

– #pragma omp directive [clause [clause]..]

• Compiling OpenMP source code :

– (GNU C compiler) : gcc –fopenmp –o exec_name file_name.c

– (Intel C compiler) : icc –o exe_file_name –openmp file_name.c



DEMO : Hello World

25



Topics










26



OpenMP : Data Environment

• OpenMP program always begins with a single thread of control – master

thread

• Context associated with the master thread is also known as the Data

Environment.

• Context is comprised of :

– Global variables

– Automatic variables

– Dynamically allocated variables

• Context of the master thread remains valid throughout the execution of the

program

• The OpenMP parallel construct may be used to either share a single copy of

the context with all the threads or provide each of the threads with a private

copy of the context.

• The sharing of Context can be performed at various levels of granularity

– Select variables from a context can be shared while keeping the context private

etc.

27



OpenMP Data Environment

• OpenMP data scoping clauses allow a programmer to decide a variable’s execution context (should a variable be shared or private.)

• 3 main data scoping clauses in OpenMP (Shared, Private, Reduction) :

• Shared :

– A variable will have a single storage location in memory for the duration of the parallel construct, i.e. references to a variable by different threads access the same memory location.

– That part of the memory is shared among the threads involved, hence modifications to the variable can be made using simple read/write operations

– Modifications to the variable by different threads is managed by underlying shared memory mechanisms

• Private :

– A variable will have a separate storage location in memory for each of the threads involved for the duration of the parallel construct.

– All read/write operations by the thread will affect the thread’s private copy of the variable .

• Reduction :

– Exhibit both shared and private storage behavior. Usually used on objects that are the target of arithmetic reduction.

– Example : summation of local variables at the end of a parallel construct

28



OpenMP Work-Sharing Directives• Work sharing constructs divide the execution of the

enclosed block of code among the group of threads.

• They do not launch new threads.

• No implied barrier on entry

• Implicit barrier at the end of work-sharing construct

• Commonly used Work Sharing constructs :

– for directive (C/C++ ; equivalent DO construct available in

Fortran but will not be covered here) : shares iterations of a

loop across a group of threads

– sections directive : breaks work into separate sections

between the group of threads; such that each thread

independently executes a section of the work.

– critical directive: serializes a section of code

29



OpenMP schedule clause

• The schedule clause defines how the iterations of a loop are divided

among a group of threads

• static : iterations are divided into pieces of size chunk and are

statically assigned to each of the threads in a round robin fashion

• dynamic : iterations divided into pieces of size chunk and

dynamically assigned to a group of threads. After a thread finishes

processing a chunk, it is dynamically assigned the next set of

iterations.

• guided : For a chunk of size of 1, the size of each chunk is

proportional to the number of unassigned iterations divided by the

number of threads, decreasing to 1. For a chunk with value k, the

same algorithm is used for determining the chunk size with the

constraint that no chunk should have less than k chunks except the

last chunk.

• Default schedule is implementation specific while the default chunk

size is usually 1

30



OpenMP for directive

• for directive helps share iterations of a loop

between a group of threads

• If nowait is specified then the threads do not wait

for synchronization at the end of a parallel loop

• The schedule clause describes how iterations of

a loop are divided among the threads in the team

(discussed in detail in the next few slides)

31

#pragma omp parallel

{

p=5;

#pragma omp for

for (i=0; i<24; i++)

x[i]=y[i]+p*(i+3)

…

…


p=5

i =0,4

p=5

i= 5,9

p=5

i= 20,24

fork

join

do / for loop

…

…

x[i]=y[i]+

…

x[i]=y[i]+

…

x[i]=y[i]+

…

…



Simple Loop Parallelization

#pragma omp parallel for

for (i=0; i<n; i++)

z( i) = a*x(i)+y

32

Master thread executing serial portion of the code

Master thread encounters parallel for loop and creates worker threads

Master and worker threads divide iterations of the for loop and execute them concurrently

Implicit barrier: wait for all threads to finish their executions

Master thread executing serial portion of the code resumes and slave threads are discarded



Example: OpenMP work sharing

Constructs

33

#include <omp.h>#define N 16main (){int i, chunk;float a[N], b[N], c[N];for (i=0; i < N; i++)a[i] = b[i] = i * 1.0;

chunk = 4;printf("a[i] + b[i] = c[i] \n");#pragma omp parallel shared(a,b,c,chunk) private(i){#pragma omp for schedule(dynamic,chunk) nowaitfor (i=0; i < N; i++)c[i] = a[i] + b[i];

} /* end of parallel section */for (i=0; i < N; i++)

printf(" %f + %f = %f \n",a[i],b[i],c[i]);}

Initializing the vectors a[i], b[i]

Instructing the runtime environment that a,b,c,chunk are shared variables and I is a private variable

Load balancing the threads using a DYNAMIC policy where array is divided into chunks of 4 and assigned to the threads

The nowait ensures that the child threads donot synchronize once their work is completed

Modified from examples posted on: https://computing.llnl.gov/tutorials/openMP/



DEMO : Work Sharing Constructs :

Shared / Private / Schedule

• Vector addition problem to be used

• Two vectors a[i] + b[i] = c[i] a[i] + b[i] = c[i]

0.000000 + 0.000000 = 0.000000

1.000000 + 1.000000 = 2.000000

2.000000 + 2.000000 = 4.000000

3.000000 + 3.000000 = 6.000000

4.000000 + 4.000000 = 8.000000

5.000000 + 5.000000 = 10.000000

6.000000 + 6.000000 = 12.000000

7.000000 + 7.000000 = 14.000000

8.000000 + 8.000000 = 16.000000

9.000000 + 9.000000 = 18.000000

10.000000 + 10.000000 = 20.000000

11.000000 + 11.000000 = 22.000000

12.000000 + 12.000000 = 24.000000

13.000000 + 13.000000 = 26.000000

14.000000 + 14.000000 = 28.000000

15.000000 + 15.000000 = 30.000000

34



OpenMP sections directive • sections directive is a non iterative work sharing

construct.

• Independent section of code are nested within a sections directive

• It specifies enclosed section of codes between different threads

• Code enclosed within a section directive is executed by a thread within the pool of threads

35

#pragma omp parallel private(p)

{

#pragma omp sections

{{ a=…;

b=…;}

#pragma omp section

{ p=…;

q=…;}

#pragma omp section

{ x=…;

y=…;}

} /* omp end sections */


a =

b =

p =

q =

x =

y =

fork

join



Understanding variables in OpenMP

• Shared variable z is modified by multiple threads

• Each iteration reads the scalar variables a and y

and the array element x[i]

• a,y,x can be read concurrently as their values

remain unchanged.

• Each iteration writes to a distinct element of z[i]

over the index range. Hence write operations can

be carried out concurrently with each iteration

writing to a distinct array index and memory

location

• The parallel for directive in OpenMP ensures that

the for loop index value (i in this case) is private to

each thread.

36

i i i i

z[ ] a x[ ] y n i

#pragma omp parallel for

for (i=0; i<n; i++)

z[i] = a*x[i]+y



Example : OpenMP Sections

37

#include <omp.h>#define N 16main (){int i;float a[N], b[N], c[N], d[N];for (i=0; i < N; i++)

a[i] = b[i] = i * 1.5;#pragma omp parallel shared(a,b,c,d) private(i){#pragma omp sections nowait

{#pragma omp sectionfor (i=0; i < N; i++)c[i] = a[i] + b[i];

#pragma omp sectionfor (i=0; i < N; i++)d[i] = a[i] * b[i];

} /* end of sections */} /* end of parallel section */…

Section : that computes the sum of the 2 vectors

Section : that computes the product of the 2 vectors

Sections construct that encloses the section calls

Modified from examples posted on: https://computing.llnl.gov/tutorials/openMP/



DEMO : OpenMP Sections

38

[LSU760000@n00 l12]$ ./sections a[i] b[i] a[i]+b[i] a[i]*b[i] 0.000000 0.000000 0.000000 0.000000 1.500000 1.500000 3.000000 2.250000 3.000000 3.000000 6.000000 9.000000 4.500000 4.500000 9.000000 20.250000 6.000000 6.000000 12.000000 36.000000 7.500000 7.500000 15.000000 56.250000 9.000000 9.000000 18.000000 81.000000 10.500000 10.500000 21.000000 110.250000 12.000000 12.000000 24.000000 144.000000 13.500000 13.500000 27.000000 182.250000 15.000000 15.000000 30.000000 225.000000 16.500000 16.500000 33.000000 272.250000 18.000000 18.000000 36.000000 324.000000 19.500000 19.500000 39.000000 380.250000 21.000000 21.000000 42.000000 441.000000 22.500000 22.500000 45.000000 506.250000



Topics










39



Thread Synchronization

• “communication” mainly through read write operations on shared

variables

• Synchronization defines the mechanisms that help in coordinating

execution of multiple threads (that use a shared context) in a parallel

program.

• Without synchronization, multiple threads accessing shared memory

location may cause conflicts by :

– Simultaneously attempting to modify the same location

– One thread attempting to read a memory location while another thread is

updating the same location.

• Synchronization helps by providing explicit coordination between

multiple threads.

• Two main forms of synchronization :

– Implicit event synchronization

– Explicit synchronization – critical, master directives in OpenMP

40



Basic Types of Synchronization

• Explicit Synchronization via mutual exclusion

– Controls access to the shared variable by providing a thread exclusive

access to the memory location for the duration of its construct.

– Critical directive of OpenMP provides mutual exclusion

• Event Synchronization

– Signals occurrence of an event across multiple threads.

– Barrier directives in OpenMP provide the simplest form of event

synchronization

– The barrier directive defines a point in a parallel program where each

thread waits for all other threads to arrive. This helps to ensure that all

threads have executed the same code in parallel upto the barrier.

– Once all threads arrive at the point, the threads can continue execution

past the barrier.

• Additional synchronization mechanisms available in OpenMP

41



OpenMP Synchronization : master

• The master directive in OpenMP marks a block of code that gets

executed on a single thread.

• The rest of the threads in the group ignore the portion of code

marked by the master directive

• Example

#pragma omp master structured block

42

Race Condition :

Two asynchronous threads access the same shared variable and at least one modifies the variable and the sequence of operations is undefined . Result of these asynchronous operations depends on detailed timing of the individual threads of the group.



OpenMP critical directive :

Explicit Synchronization

• Race conditions can be avoided by controlling access to shared variables by allowing threads to have exclusive access to the variables

• Exclusive access to shared variables allows the thread to atomically perform read, modify and update operations on the variable.

• Mutual exclusion synchronization is provided by the critical directive of OpenMP

• Code block within the critical region defined by critical /end critical directives can be executed only by one thread at a time.

• Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.

43

fork

join

- critical region

int x

x=0;

#pragma omp parallel shared(x)

{

#pragma omp critical

x = 2*x + 1;




Simple Example : critical

44

cnt = 0;

f = 7;

#pragma omp parallel

{

#pragma omp for

for (i=0;i<20;i++){

if(b[i] == 0){

#pragma omp critical

cnt ++;

} /* end if */

a[i]=b[i]+f*(i+1);

} /* end for */


cnt=0f=7

i =0,4 i=5,9 i= 20,24i= 10,14

if …if …

if … i= 20,24

cnt++

cnt++

cnt++

cnt++a[i]=b[i]+…

a[i]=b[i]+…

a[i]=b[i]+…

a[i]=b[i]+…



Topics










45



OpenMP : Reduction

• performs reduction on shared variables in list based on the operator provided.

• for C/C++ operator can be any one of :

– +, *, -, ^, |, ||, & or &&

– At the end of a reduction, the shared variable contains the result obtained upon

combination of the list of variables processed using the operator specified.

46

sum = 0.0

#pragma omp parallel for reduction(+:sum)

for (i=0; i < 20; i++)

sum = sum + (a[i] * b[i]);

sum=0

i=0,4 i=5,9 i=10,14 i=15,19

sum=.. sum=.. sum=.. sum=..

∑sum

sum=0



Example: Reduction

47

#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++){

a[i] = i * 1.0;b[i] = i * 2.0;

}#pragma omp parallel for default(shared) private(i) \

schedule(static,chunk) reduction(+:result)for (i=0; i < n; i++)

result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors

∑a[i]*b[i]

SRC : https://computing.llnl.gov/tutorials/openMP/



Demo: Dot Product using Reduction

48

[LSU760000@n00 l12]$ ./reduction a[i] b[i] a[i]*b[i]0.000000 0.000000 0.0000001.000000 2.000000 2.0000002.000000 4.000000 8.0000003.000000 6.000000 18.0000004.000000 8.000000 32.0000005.000000 10.000000 50.0000006.000000 12.000000 72.0000007.000000 14.000000 98.0000008.000000 16.000000 128.0000009.000000 18.000000 162.00000010.000000 20.000000 200.00000011.000000 22.000000 242.00000012.000000 24.000000 288.00000013.000000 26.000000 338.00000014.000000 28.000000 392.00000015.000000 30.000000 450.000000Final result= 2480.000000



Topics










49



Synopsis of Commands

• How to invoke OpenMP runtime systems #pragma omp parallel

• The interplay between OpenMP environment variables and

runtime system (omp_get_num_threads(),

omp_get_thread_num())

• Shared data directives such as shared, private and reduction

• Basic flow control using sections, for

• Fundamentals of synchronization using critical directive and

critical section.

• And directives used for the OpenMP programming part of the

problem set.

50



Topics










51



Summary – Material for Test

• HPC Modalities – 4,5

• Performance issues in shared memory programming – 7,

8, 9, 10

• OpenMP runtime library routines – 16, 17

• OpenMP environment variables – 18, 19, 20

• OpenMP data environment 27, 28

• OpenMP work sharing directives – 29, 30, 31, 35, 36

• OpenMP thread synchronization – 40, 41, 42, 43

• OpenMP reduction 46

52



53


CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011


MEANS

APPLIED PARALLEL ALGORITHMS 1

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 10th, 2011



Dr. Hartmut Kaiser

Center for Computation & Technology

R315 Johnston

[email protected]

2



Puzzle of the Day

• What’s the difference between the following valid C

function declarations:

void foo();void foo(void);void foo(…);



Puzzle of the Day



• What’s the difference between the following valid C++ function declarations:

void foo();void foo(void);void foo(…);

void foo(); any number of parametersvoid foo(void); no parametervoid foo(…); any number of parameters



Puzzle of the Day



void foo(); any number of parametersvoid foo(void); no parametersvoid foo(…); any number of parameters

• What’s the difference between the following valid C++ function declarations:

void foo(); no parametersvoid foo(void); no parametersvoid foo(…); any number of parameters



6

Topics

• Introduction

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Vector Dot-Product

• Matrix Multiplication



7

Topics

• Introduction

• Mandelbrot Sets






8

Parallel Programming

• Goals

– Correctness

– Reduction in execution time

– Efficiency

– Scalability

– Increased problem size and richness of models

• Objectives

– Expose parallelism

• Algorithm design

– Distribute work uniformly

• Data decomposition and allocation

• Dynamic load balancing

– Minimize overhead of synchronization and communication

• Coarse granularity

• Big messages

– Minimize redundant work

• Still sometimes better than communication



9

Basic Parallel (MPI) Program Steps

• Establish logical bindings

• Initialize application execution environment

• Distribute data and work

• Perform core computations in parallel (across nodes)

• Synchronize and Exchange intermediate data results– Optional for non-embarrassingly parallel (cooperative)

• Detect “stop” condition– Maybe implicit with a barrier etc.

• Aggregate final results– Often a reduction operator

• Output results and error code

• Terminate and return to OS



10

“embarrassingly parallel”

• Common phrase

– poorly defined,

– widely used

• Suggests lots and lots of parallelism

– with essentially no inter task communication or coordination

– Highly partitionable workload with minimal overhead

• “almost embarrassingly parallel”

– Same as above, but

– Requires master to launch many tasks

– Requires master to collect final results of tasks

– Sometimes still referred to as “embarrassingly parallel”



11

Topics

• Introduction

• Mandelbrot Sets






Mandelbrot set

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.

Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

12



Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson

& M. Allen,


Mandelbrot Set

Set of points in a complex plane that are quasi-stable (will

increase and decrease, but not exceed some limit) when

computed by iterating the function

where zk+1 is the (k + 1)th iteration of the complex number z =

(a + bi) and c is a complex number giving position of point in

the complex plane. The initial value for z is zero.

Iterations continued until magnitude of z is greater than 2 or

number of iterations reaches arbitrary limit. Magnitude of z

is the length of the vector given by

13



Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,


Sequential routine computing value of

one point returning number of iterations

structure complex {

float real;

float imag;

};

int cal_pixel(complex c)

{

int count, max;

complex z;

float temp, lengthsq;

max = 256;

z.real = 0; z.imag = 0;

count = 0; /* number of iterations */

do {

temp = z.real * z.real - z.imag * z.imag + c.real;

z.imag = 2 * z.real * z.imag + c.imag;

z.real = temp;

lengthsq = z.real * z.real + z.imag * z.imag;

count++;

} while ((lengthsq < 4.0) && (count < max));

return count;

}

14



Parallelizing Mandelbrot Set Computation

Static Task Assignment

Simply divide the region into fixed number of parts, each

computed by a separate processor.

Not very successful because different regions require

different numbers of iterations and time.

Dynamic Task Assignment

Have processor request regions after computing previous

regions



15





Dynamic Task AssignmentWork Pool/Processor Farms

16



17

Flowchart for Mandelbrot Set

Generation“master” “workers”



Initialize MPI Environment … Initialize MPI

Environment


…








Calculate Mandelbrot set

values across work region

…

… Calculate

Mandelbrot set values across work region





Write result from task 0 to file

Recv. results from “workers”



Send result to “master”…

Concatenate results to file

End



18

Mandelbrot Sets (source code)#include<stdio.h>

#include<assert.h>

#include<stdlib.h>

#include<mpi.h>

typedef struct complex{

double real;

double imag;

} Complex;

int cal_pixel(Complex c){

int count, max_iter;

Complex z;

double temp, lengthsq;

max_iter = 256;

z.real = 0;

z.imag = 0;

count = 0;

do{

temp = z.real * z.real - z.imag * z.imag + c.real;

z.imag = 2 * z.real * z.imag + c.imag;

z.real = temp;

lengthsq = z.real * z.real + z.imag * z.imag;

count ++;

}

while ((lengthsq < 4.0) && (count < max_iter));

return(count);

} Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

cal_pixel () runs on every worker process calculates the :

for every pixel



19

Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){FILE *file;int i, j; int tmp;Complex c;double *data_l, *data_l_tmp;int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype;MPI_Status status;

/***** Initializing MPI Environment*****/

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nprocs);MPI_Comm_rank(MPI_COMM_WORLD, &mype);

/***** Pass in the dimension (X,Y) of the area to cover *****/

if (argc != 3){int err = 0;printf("argc %d\n", argc);if (mype == MASTERPE){

printf("usage: mandelbrot nx ny");MPI_Abort(MPI_COMM_WORLD,err );

}}/* get command line args */nx = atoi(argv[1]);ny = atoi(argv[2]);

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/


Check if the input arguments : x,y dimensions of the region to be processed are passed



20

Mandelbrot Sets (source code)

/* assume divides equally */nrows_l = nx/nprocs; mystrt = mype*nrows_l;myend = mystrt + nrows_l - 1;

/* create buffer for local work only */data_l = (double *) malloc(nrows_l * ny * sizeof(double));data_l_tmp = data_l;

/* calc each procs coordinates and call local mandelbrot value generation function */for (i = mystrt; i <= myend; ++i){c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){

c.imag = j/((double) ny) * 4. - 2. ;

tmp = cal_pixel(c); *data_l++ = (double) tmp;

}}data_l = data_l_tmp;

Source :

http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Determining the dimensions of the work to be performed by each concurrent task.

Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated



21

Mandelbrot Sets (source code)

if (mype == MASTERPE){file = fopen("mandelbrot.bin_0000", "w");printf("nrows_l, ny %d %d\n", nrows_l, ny);fwrite(data_l, nrows_l*ny, sizeof(double), file);fclose(file);for (i = 1; i < nprocs; ++i){

MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);printf("received message from proc %d\n", i);file = fopen("mandelbrot.bin_0000", "a");fwrite(data_l, nrows_l*ny, sizeof(double), file);fclose(file);}

}else{

MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD);}

MPI_Finalize();}

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Master process opens a file to store output into and stores its values in the file

Master then waits to receive values computed by each of the worker processes

Worker processes send computed mandelbrot values of their region to the master process



22

Demo : Mandelbrot Sets



Demo: Mandelbrot Sets

23


http://upload.wikimedia.org/wikipedia/commons/2/21/Mandel_zoom_00_mandelbrot_set.jpg


24

Topics

• Introduction

• Mandelbrot Sets






25



Monte Carlo Simulation

• Used when it is infeasible or impossible to compute

an exact result with a deterministic algorithm

• Especially useful in

– Studying systems with a large number of coupled degrees

of freedom

• Fluids, disordered materials, strongly coupled solids, cellular

structures

– For modeling phenomena with significant uncertainty in

inputs

• The calculation of risk in business

– These methods are also widely used in mathematics

• The evaluation of definite integrals, particularly multidimensional

integrals with complicated boundary conditions

26



Monte Carlo Simulation

• No single approach, multitude of different methods

• Usually follows pattern

– Define a domain of possible inputs

– Generate inputs randomly from the domain

– Perform a deterministic computation using the inputs

– Aggregate the results of the individual computations into the final result

• Example: calculate Pi

27



28

Monte Carlo: Algorithm for Pi

• The value of PI can be calculated in a number of

ways. Consider the following method of

approximating PI: Inscribe a circle in a square

• Randomly generate points in the square

• Determine the number of points in the square that

are also in the circle

• Let r be the number of points in the circle divided

by the number of points in the square

• PI ~ 4 r

• Note that the more points generated, the better

the approximation

• Algorithm :

npoints = 10000

circle_count = 0

do j = 1,npoints

generate 2 random numbers between 0 and 1

xcoordinate = random1 ; ycoordinate = random2

if (xcoordinate, ycoordinate) inside circle

then circle_count = circle_count + 1

end do

PI = 4.0*circle_count/npoints



29



30

OpenMP Pi Calculation

Initialize variables

Initialize OpenMP parallel environment

Calculate PI

Print value of pi

N WorkerThreadsMaster Thread

Generate random X,Y Generate random X,Y Generate random X,Y

Calculate Z=X^2+Y^2 Calculate Z =X^2+Y^2

If point lies within the circle

Calculate Z =X^2+Y^2



Count ++ Count ++Count ++

Reduction ∑

Y

N N N

Y Y



OpenMP Calculating Pi

31

#include <omp.h>#include <stdlib.h>#include <stdio.h>#include <time.h>#define SEED 42

main(int argc, char* argv){

int niter=0;double x,y;int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */double z;double pi;time_t rawtime;struct tm * timeinfo;

printf("Enter the number of iterations used to estimate pi: ");scanf("%d",&niter);time ( &rawtime );timeinfo = localtime ( &rawtime );

Seed for generating random number

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML



OpenMP Calculating Pi

32

printf ( "The current date/time is: %s", asctime (timeinfo) );/* initialize random numbers */srand(SEED);

#pragma omp parallel for private(x,y,z,tid) reduction(+:count)for ( i=0; i<niter; i++) {

x = (double)rand()/RAND_MAX;y = (double)rand()/RAND_MAX;z = (x*x+y*y);if (z<=1) count++;if (i==(niter/6)-1) {

tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

}if (i==(niter/3)-1) {


}if (i==(niter/2)-1) {


} http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Initialize random number generator; srand is used to seed the random number generated by rand()

Randomly generate x,y points

Initialize OpenMP parallel for with reduction(∑)

Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count



Calculating Pi

33

if (i==(2*niter/3)-1) {tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

}if (i==(5*niter/6)-1) {


}if (i==niter-1) {


}}time ( &rawtime );timeinfo = localtime ( &rawtime );printf ( "The current date/time is: %s", asctime (timeinfo) );printf(" the total count is %i\n",count);pi=(double)count/niter*4;

printf("# of trials= %d , estimate of pi is %g \n",niter,pi);return 0;

}

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Calculate PI based on the aggregate count of the points that lie within the circle



Demo : OpenMP Pi

34

[cdekate@celeritas l13]$ ./omcpiEnter the number of iterations used to estimate pi: 100000The current date/time is: Tue Mar 4 05:53:52 2008thread 0 just did iteration 16665 the count is 13124thread 1 just did iteration 33332 the count is 6514thread 1 just did iteration 49999 the count is 19609thread 2 just did iteration 66665 the count is 13048thread 3 just did iteration 83332 the count is 6445thread 3 just did iteration 99999 the count is 19489The current date/time is: Tue Mar 4 05:53:52 2008the total count is 78320# of trials= 100000 , estimate of pi is 3.1328[cdekate@celeritas l13]$



35

Creating Custom Communicators

• Communicators define groups and the access patterns

among them

• Default communicator is MPI_COMM_WORLD

• Some algorithms demand more sophisticated control of

communications to take advantage of reduction

operators

• MPI permits creation of custom communicators

• MPI_Comm_create



36

MPI Monte Carlo Pi Computation

Initialize MPIEnvironment

Receive Request

Compute Random Array

Send Array to Requestor

Last Request?

Finalize MPI

Y

N

Server


WorkerMaster

Receive Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Finalize MPI

N

Y

Propagate Number of Points (Allreduce)


Broadcast Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Print Statistics

N

Y

Propagate Number of Points (Allreduce)

Finalize MPI

Output Partial Result



37

Monte Carlo : MPI - Pi (source code)#include <stdio.h>#include <math.h>#include "mpi.h“#define CHUNKSIZE 1000#define INT_MAX 1000000000#define REQUEST 1#define REPLY 2int main( int argc, char *argv[] ){

int iter;int in, out, i, iters, max, ix, iy, ranks[1], done, temp;double x, y, Pi, error, epsilon;int numprocs, myid, server, totalin, totalout, workerid;int rands[CHUNKSIZE], request;MPI_Comm world, workers;MPI_Group world_group, worker_group;MPI_Status status;

MPI_Init(&argc,&argv);world = MPI_COMM_WORLD;MPI_Comm_size(world,&numprocs);MPI_Comm_rank(world,&myid);

Initialize MPI environment



38

Monte Carlo : MPI - Pi (source code)

server = numprocs-1; /* last proc is server */if (myid == 0)

sscanf( argv[1], "%lf", &epsilon );

MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD );MPI_Comm_group( world, &world_group );ranks[0] = server;MPI_Group_excl( world_group, 1, ranks, &worker_group );

MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group);

if (myid == server) { do {

MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {

for (i = 0; i < CHUNKSIZE; ) {rands[i] = random();if (rands[i] <= INT_MAX) i++; }/* Send random number array*/

MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); }} while( request>0 );

}else { /* Begin Worker Block */

request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );MPI_Comm_rank( workers, &workerid );iter = 0;

Broadcast Error Bounds: epsilon

Create a custom communicator

Server process : 1. Receives request to generate a random ,2. Computes the random number array, 3. Send array to requestor

Worker process : Request the server to generate a random number array



39

Monte Carlo : MPI - Pi (source code)while (!done) {

iter++;request = 1; /* Recv. random array from server*/

MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status );for (i=0; i<CHUNKSIZE-1; ) {

x = (((double) rands[i++])/max) * 2 - 1;y = (((double) rands[i++])/max) * 2 - 1;if (x*x + y*y < 1.0) in++;else out++;

}

MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers);MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers);Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi-3.141592653589793238462643);done = (error < epsilon || (totalin+totalout) > 1000000);request = (done) ? 0 : 1;if (myid == 0) { /* If “Master” : Print current value of PI */

printf( "\rpi = %23.20f", Pi );MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );

}else { /* If “Worker” : Request new array if not finished */

if (request)MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);

}}MPI_Comm_free(&workers);

}

Worker : Receive random number array from the Server

Worker: For each pair of x,y in the random number array, calculate the coordinates

Determine if the number is inside or out of the circle

Print current value of PI and request for more work

Compute the value of pi and Check if error is within threshhold



40

Monte Carlo : MPI - Pi (source code)

if (myid == 0) {/* If “Master” : Print Results */

printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n",totalin+totalout, totalin, totalout );

getchar();}MPI_Finalize();

}

Print the final value of PI



41

Demo : MPI Monte Carlo, Pi

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696



42

Topics

• Introduction

• Mandelbrot Sets






Vector Dot Product

• Multiplication of 2 vectors followed by Summation

43

A[i]

X1

X2

X3

X4

X5

… …

Xn

B[i]

Y1

Y2

Y3

Y4

Y5

… …

Yn

∙ =n

i 1

A[i] * B[i]

X1* Y1

X2* Y2

X3* Y3

X4* Y4

X5* Y5

… …

Xn* Yn



44

OpenMP Dot Product : using Reduction

Initialize variables

Initialize OpenMP parallel environment

Calculate local computations



REDUCTION : ∑

Print value of Dot Product

N WorkerThreadsMaster Thread

Master Thread

Workload and schedule is determined by OpenMP

during runtime



OpenMP Dot Product

45

#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++){

a[i] = i * 1.0;b[i] = i * 2.0;

}#pragma omp parallel for default(shared) private(i) \

schedule(static,chunk) reduction(+:result)for (i=0; i < n; i++)

result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors

∑a*i+*b*i+




Demo: Dot Product using Reduction

46

[cdekate@celeritas l12]$ ./reductiona[i] b[i] a[i]*b[i]0.000000 0.000000 0.0000001.000000 2.000000 2.0000002.000000 4.000000 8.0000003.000000 6.000000 18.0000004.000000 8.000000 32.0000005.000000 10.000000 50.0000006.000000 12.000000 72.0000007.000000 14.000000 98.0000008.000000 16.000000 128.0000009.000000 18.000000 162.00000010.000000 20.000000 200.00000011.000000 22.000000 242.00000012.000000 24.000000 288.00000013.000000 26.000000 338.00000014.000000 28.000000 392.00000015.000000 30.000000 450.000000Final result= 2480.000000[cdekate@celeritas l12]$



47

MPI Dot Product Computation

Initialize Variables

WorkerMaster

Initialize MPI environment

Receive Size of vectors

Receive local workload for Vector A

Receive local workload for Vector B

Initialize Variables


Broadcast Size of Vectors

Get Vector A &Distribute Partitioned Vector A

Get Vector B & Distribute Partitioned Vector B

Calculate dot-product for local workloads

Print Result

REDUCTION ∑

Calculate dot-product for local workloads



MPI Dot Product

48

#include <stdio.h>#include "mpi.h"#define MAX_LOCAL_ORDER 100main(int argc, char* argv[]) {

float local_x[MAX_LOCAL_ORDER];float local_y[MAX_LOCAL_ORDER];int n;int n_bar; /* = n/p */float dot;int p;int my_rank;void Read_vector(char* prompt, float local_v[], int n_bar, int p,

int my_rank);float Parallel_dot(float local_x[], float local_y[], int n_bar);

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &p);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);if (my_rank == 0) {

printf("Enter the order of the vectors\n");scanf("%d", &n);

}

MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);


Broadcast the order of vectors across the workers

Parallel Programming with MPI

by

Peter Pacheco



MPI Dot Product

49

n_bar = n/p;

Read_vector("the first vector", local_x, n_bar, p, my_rank);Read_vector("the second vector", local_y, n_bar, p, my_rank);

dot = Parallel_dot(local_x, local_y, n_bar);

if (my_rank == 0)printf("The dot product is %f\n", dot);

MPI_Finalize();} /* main */

void Read_vector(char* prompt /* in */,float local_v[] /* out */,int n_bar /* in */,int p /* in */,int my_rank /* in */) {

int i, q;

Receive and distribute the two vectors

Calculate the parallel dot product for local workloads

Master: Print the result of the dot product


by

Peter Pacheco



MPI Dot Product

50

float temp[MAX_LOCAL_ORDER];MPI_Status status;

if (my_rank == 0) {printf("Enter %s\n", prompt);for (i = 0; i < n_bar; i++)

scanf("%f", &local_v[i]);for (q = 1; q < p; q++) {

for (i = 0; i < n_bar; i++)scanf("%f", &temp[i]);

MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD);}

} else {MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD,

&status);}

} /* Read_vector */

float Serial_dot(float x[] /* in */,

MASTER: Get the input from the User prepare the local workload

Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing

Worker : Receive the local workload to be processed

Serial_dot() : calculates the dot product on local arrays

Parallel Programming with MPI by

Peter Pacheco



MPI Dot Product

51

float y[] /* in */,int n /* in */) {

int i;float sum = 0.0;for (i = 0; i < n; i++)

sum = sum + x[i]*y[i];return sum;

} /* Serial_dot */float Parallel_dot(

float local_x[] /* in */,float local_y[] /* in */,int n_bar /* in */) {

float local_dot;float dot = 0.0;

local_dot = Serial_dot(local_x, local_y, n_bar);MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT,

MPI_SUM, 0, MPI_COMM_WORLD);return dot;

} /* Parallel_dot */

Serial_dot() : calculates the dot product on local arrays

Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workload

Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM)


by

Peter Pacheco



Demo: MPI Dot Product

52

*cdekate@celeritas l13+$ mpirun …. ./mpi_dotEnter the order of the vectors16Enter the first vector0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Enter the second vector0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30The dot product is 2480.000000[cdekate@celeritas l13]$



53

Topics

• Introduction

• Mandelbrot Sets






54





55

Matrix-Vector Multiplicationc = A xb



56

Implementing Matrix Multiplication

Sequential Code

Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be

for (i = 0; i < n; i++)

for (j = 0; j < n; j++) {c[i][j] = 0;for (k = 0; k < n; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];}

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3).Very easy to parallelize.






Implementing Matrix Multiplication

• With n processors (and n x n matrices), we can obtain:

• Time complexity of O(n2) with n processors• Each instance of inner loop is independent and can be done by a

separate processor

• Time complexity of O(n) with n2 processors• One element of A and B assigned to each processor.

• Cost optimal since O(n3) = n x O(n2) = n2 x O(n).

• Time complexity of O(log n) with n3 processors• By parallelizing the inner loop.

• Not cost-optimal since O(n3) < n3 x O(log n).

• O(log n) lower bound for parallel matrix multiplication.

57



58

Block Matrix Multiplication




Partitioning into sub-matricies



59


Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Multiplication



60

Performance Improvement

Using tree construction n numbers can be added in O(log n) steps (using n3 processors):

Slides for Parallel Programming Techniques & Applications Using Networked

Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @

2004 Pearson Education Inc. All rights reserved.



61

OpenMP: Flowchart for Matrix Multiplication

Initialize variables & matrices

Initialize OpenMP Environment

Compute the Matrix product for the local workload

Print Results



Schedule and workload chunksize are determined based on user preferences

during compile/run time

Since each thread works on portion of the array and updates different parts of the same

array synchronization is not needed



OpenMP Matrix Multiplication

62

#include <stdio.h>#include <omp.h>

/* Main Program */

main(){

int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k;NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4;float Matrix_A[NoofRows_A][NoofCols_A];float Matrix_B[NoofRows_B][NoofCols_B];float Result[NoofRows_A][NoofCols_B];

for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_A; j++)

Matrix_A[i][j] = i + j;}/* Matrix_B Elements */for (i = 0; i < NoofRows_B; i++) {

for (j = 0; j < NoofCols_B; j++)Matrix_B[i][j] = i + j;

}printf("The Matrix_A Is \n");

Initialize the two Matrices A[][] & B[][] with sum of their index values




OpenMP Matrix Multiplication

63

for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_A; j++)

printf("%f \t", Matrix_A[i][j]);printf("\n");

}printf("The Matrix_B Is \n");for (i = 0; i < NoofRows_B; i++) {

for (j = 0; j < NoofCols_B; j++)printf("%f \t", Matrix_B[i][j]);

printf("\n");}

for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_B; j++) {

Result[i][j] = 0.0;}

}#pragma omp parallel for private(j,k)

for (i = 0; i < NoofRows_A; i = i + 1)for (j = 0; j < NoofCols_B; j = j + 1)

for (k = 0; k < NoofCols_A; k = k + 1)Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j];

printf("\nThe Matrix Computation Result Is \n");

Initialize the results matrix with 0.0

Print the Matrices for debugging purposes

Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMPenvironment variables and the number of threads




OpenMP Matrix Multiplicaton

64

for (i = 0; i < NoofRows_A; i = i + 1) {for (j = 0; j < NoofCols_B; j = j + 1)

printf("%f ", Result[i][j]);printf("\n");

}}




DEMO : OpenMP Matrix Multiplication

65

[cdekate@celeritas l13]$ ./omp_mmThe Matrix_A Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000The Matrix_B Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000

The Matrix Computation Result Is14.000000 20.000000 26.000000 32.00000020.000000 30.000000 40.000000 50.00000026.000000 40.000000 54.000000 68.00000032.000000 50.000000 68.000000 86.000000[cdekate@celeritas l13]$



66

Flowchart for MPI Matrix Multiplication

“master” “workers”





Initialize Array

Partition Array into workloads

Send Workload to “workers”

Recv. work Recv. work … Recv. work

wait for “workers“ to finish task

Calculate matrix product



…

Send result Send result … Send result

Recv. results

Print results

End



67

Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 4 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */

MPI_Status status;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Initialize the MPI environment

Source : http://www.llnl.gov/computing/

tutorials/mpi/samples/C/mpi_mm.c



68

Matrix Multiplication (source code)if (numtasks < 2 ) {printf("Need at least two MPI tasks. Quitting...\n");MPI_Abort(MPI_COMM_WORLD, rc);exit(1);}

numworkers = numtasks-1;

if (taskid == MASTER){for (i=0; i<NRA; i++)for (j=0; j<NCA; j++){

a[i][j]= i+j+1;b[i][j]= i+j+1; }

printf("Matrix A :: \n");

for (i=0; i<NRA; i++){printf("\n");for (j=0; j<NCB; j++)

printf("%6.2f ", a[i][j]); }printf("Matrix B :: \n");for (i=0; i<NRA; i++) {

printf("\n");for (j=0; j<NCB; j++)

printf("%6.2f ", b[i][j]);averow = NRA/numworkers;extra = NRA%numworkers;offset = 0;mtype = FROM_MASTER;



MASTER: Initialize the matrix A & B

Print the two matrices for Debugging purposes

Calculate the number of rows to be processed by each worker

Calculate the number of overflow rows to be processed additionally by each worker



69

Matrix Multiplication (source code)for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */

rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);

MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);

offset = offset + rows;}

/* Receive results from worker tasks */mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */for (i=1; i<=numworkers; i++){

source = i;/* offset stores the (processing) starting point of work chunk */

MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status);printf("Received results from task %d\n",source);

}printf("******************************************************\n");printf("Result Matrix:\n");for (i=0; i<NRA; i++){


printf("%6.2f ", c[i][j]);}printf("\n******************************************************\n");printf ("Done.\n");

}

MASTER : Send the workload chunk across to each of the worker

MASTER: Receive the workload chunk from the workersc[][] contains the matrix products calculated for each workload chunk by the corresponding worker





70

Matrix Multiplication (source code)/**************************** worker task ************************************/

if (taskid > MASTER){

mtype = FROM_MASTER;

MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++)for (i=0; i<rows; i++){

c[i][k] = 0.0;for (j=0; j<NCA; j++)

/* Calculate the product and store result in C */c[i][k] = c[i][k] + a[i][j] * b[j][k];

}mtype = FROM_WORKER;

MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);

/* Worker sends the resultant array to the master */MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);

}MPI_Finalize();

}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

WORKER: Receive the workload to be processed by each worker

Calculate the matrix product and store the result in c[][]

Send the computed results array to the Master

Source : http://www.llnl.gov/com

puting/tutorials/mpi/sample

s/C/mpi_mm.c



71

Demo : Matrix Multiplication

[cdekate@celeritas matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Matrix A ::1.00 2.00 3.00 4.002.00 3.00 4.00 5.003.00 4.00 5.00 6.004.00 5.00 6.00 7.00

Matrix B ::1.00 2.00 3.00 4.002.00 3.00 4.00 5.003.00 4.00 5.00 6.004.00 5.00 6.00 7.00

Sending 2 rows to task 1 offset=0Sending 1 rows to task 2 offset=2Sending 1 rows to task 3 offset=3Received results from task 1Received results from task 2Received results from task 3Result Matrix:30.00 40.00 50.00 60.0040.00 54.00 68.00 82.0050.00 68.00 86.00 104.0060.00 82.00 104.00 126.00[cdekate@celeritas matrix_multiplication]$



72


CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011


MEANS

APPLIED PARALLEL ALGORITHMS 2

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 18, 2011



Puzzle of the Day

• Some nice ways to get something different from what

was intended:

2

if(a = 0) { … }/* a always equals 0, but block will never be executed */

if(0 < a < 5) { … }/* this "boolean" is always true! [think: (0 < a) < 5] */

if(a =! 0) { … }/* a always equal to 1, as this is compiled as (a = !0), an assignment,

rather than (a != 0) or (a == !0) */



Topics

• Array Decomposition

• Matrix Transpose

• Gauss-Jordan Elimination

• LU Decomposition


3



Topics






4



5

Parallel Matrix Processing & Locality

• Maximize locality– Spatial locality

• Variable likely to be used if neighbor data is used

• Exploits unit or uniform stride access patterns

• Exploits cache line length

• Adjacent blocks minimize message traffic– Depends on volume to surface ratio

– Temporal locality• Variable likely to be reused if already recently used

• Exploits cache loads and LRU (least recently used) replacement policy

• Exploits register allocation

– Granularity• Maximizes length of local computation

• Reduces number of messages

• Maximizes length of individual messages



6

Array Decomposition

• Simple MPI Example

• Master-Worker Data Partitioning and Distribution– Array decomposition

– Uniformly distributes parts of array among workers• (and master)

– A kind of static load balancing• Assumes equal work on equal data set sizes

• Demonstrates– Data partitioning

– Data distribution

– Coarse grain parallel execution• No communication between tasks

– Reduction operator

– Master-worker control model



7

Array Decomposition Layout

• Dimensions – 1 dimension: linear (dot product)

– 2 dimensions: “2-D” or (matrix operations)

– 3 dimensions (higher order models)

– Impacts surface to volume ratio for inter process communications

• Distribution – Block

• Minimizes messaging

• Maximizes message size

– Cyclic

• Improves load balancing

• Memory layout– C vs. FORTRAN



8

Array Decomposition

Accumulate sum from each part

rayCompleteAr



9

Array Decomposition

Demonstrate simple data decomposition :

– Master initializes array and then distributes an equal portion of the array

among the other tasks.

– The other tasks receive their portion of the array, they perform an

addition operation to each array element.

– Each task maintains the sum for their portion of the array

– The master task does likewise with its portion of the array.

– As each of the non-master tasks finish, they send their updated portion

of the array to the master.

– An MPI collective communication call is used to collect the sums

maintained by each task.

– Finally, the master task displays selected parts of the final array and the

global sum of all array elements.

– Assumption : that the array can be equally divided among the group.



10

Flowchart for Array Decomposition“master” “workers”





Initialize Array

Partition Array into workloads

Send Workload to “workers”

Recv. work Recv. work … Recv. work

Calculate Sum for array chunk




…

Send Sum Send Sum … Send Sum

Recv. results

Reduction Operator to Sum up results

Print results

End



11

Array Decompositon (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define ARRAYSIZE 16000000#define MASTER 0

float data[ARRAYSIZE];int main (int argc, char **argv){int numtasks, taskid, rc, dest, offset, i, j, tag1,

tag2, source, chunksize; float mysum, sum;float update(int myoffset, int chunk, int myid);

MPI_Status status;

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks % 4 != 0) {

printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/MPI_Abort(MPI_COMM_WORLD, rc);exit(0);}

MPI_Comm_rank(MPI_COMM_WORLD,&taskid);printf ("MPI task %d has started...\n", taskid);

chunksize = (ARRAYSIZE / numtasks);tag2 = 1;tag1 = 2;

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Workload to be processed by each processor



12

Array Decompositon (source code)

if (taskid == MASTER){sum = 0;

for(i=0; i<ARRAYSIZE; i++) {data[i] = i * 1.0;sum = sum + data[i];}

printf("Initialized array sum = %e\n",sum);

offset = chunksize;for (dest=1; dest<numtasks; dest++) {MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD);printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset);offset = offset + chunksize;}

offset = 0;

mysum = update(offset, chunksize, taskid);

for (i=1; i<numtasks; i++) {source = i;

MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);}


Initialize array

Array[0] -> Array[offset-1] is processed by master

Send workloads to respective processorsMaster computes

local Sum

Master receives summation computed by workers



13


MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);printf("Sample results: \n");offset = 0;for (i=0; i<numtasks; i++) {

for (j=0; j<5; j++) printf(" %e",data[offset+j]);

printf("\n");offset = offset + chunksize;}

printf("*** Final sum= %e ***\n",sum);} /* end of master section */

if (taskid > MASTER) {/* Receive my portion of array from the master task */source = MASTER;

MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);mysum = update(offset, chunksize, taskid);/* Send my results back to the master task */dest = MASTER;

MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD);MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);} /* end of non-master */


Master computes the SUM of all workloads

Worker processes receive work chunks from master

Each worker computes local sum

Send local sum to master process



14


MPI_Finalize();

} /* end of main */

float update(int myoffset, int chunk, int myid) {int i; float mysum;/* Perform addition to each of my array elements and keep my sum */mysum = 0;for(i=myoffset; i < myoffset + chunk; i++) {data[i] = data[i] + i * 1.0;mysum = mysum + data[i];}

printf("Task %d mysum = %e\n",myid,mysum);return(mysum);}




15

Demo : Array Decomposition

[lsu00@master array_decomposition]$ mpiexec -np 4 ./array

MPI task 0 has started...




Initialized array sum = 1.335708e+14

Sent 4000000 elements to task 1 offset= 4000000


Task 1 mysum = 4.884048e+13


Task 2 mysum = 7.983003e+13

Task 0 mysum = 1.598859e+13

Task 3 mysum = 1.161867e+14

Sample results:

0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00

8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06

1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07

2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07

*** Final sum= 2.608458e+14 ***

Output from arete for a 4 processor run.



Topics






16



Matrix Transpose• The transpose of the (m x n) matrix A is the (n x m) matrix

formed by interchanging the rows and columns such that row ibecomes column i of the transposed matrix

mnnn

m

m

T

aaa

aaa

aaa

21

22212

12111

A

mnmm

n

n

aaa

aaa

aaa

21

22221

11211

A

010

431A

04

13

01

TA

52

31A

53

21TA

17



Matrix Transpose - OpenMP

18

#include <stdio.h>#include <sys/time.h>#include <omp.h>#define SIZE 4

main(){

int i, j;float Matrix[SIZE][SIZE], Trans[SIZE][SIZE];for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)Matrix[i][j] = (i * j) * 5 + i;

}for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)Trans[i][j] = 0.0;

}

Initialize source matrix

Initialize results matrix



Matrix Transpose - OpenMP

19

#pragma omp parallel for private(j)for (i = 0; i < SIZE; i++)

for (j = 0; j < SIZE; j++)

Trans[j][i] = Matrix[i][j];printf("The Input Matrix Is \n");for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)printf("%f \t", Matrix[i][j]);

printf("\n");}printf("\nThe Transpose Matrix Is \n");for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)printf("%f \t", Trans[i][j]);

printf("\n");}

return 0;}

Perform transpose in parallel using omp parallel for



Matrix Transpose – OpenMP (DEMO)

20

[LSU760000@n01 matrix_transpose]$ ./omp_mtrans

The Input Matrix Is 0.000000 0.000000 0.0000000 0.0000000 1.000000 6.000000 11.000000 16.000000 2.000000 12.000000 22.000000 32.000000 3.000000 18.000000 33.000000 48.000000

The Transpose Matrix Is 0.000000 1.0000000 2.0000000 3.0000000 0.000000 6.0000000 12.000000 18.000000 0.000000 11.000000 22.000000 33.000000 0.000000 16.000000 32.000000 48.000000



Matrix Transpose - MPI

21

#include <stdio.h>#include "mpi.h"#define N 4int A[N][N];void fill_matrix(){int i,j;for(i = 0; i < N; i ++)

for(j = 0; j < N; j ++)A[i][j] = i * N + j;

}void print_matrix(){int i,j;for(i = 0; i < N; i ++) {

for(j = 0; j < N; j ++)printf("%d ", A[i][j]);

printf("\n");}

}

Initialize source matrix




22

main(int argc, char* argv[]){int r, i;MPI_Status st;MPI_Datatype typ;

MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &r);

if(r == 0) {fill_matrix();printf("\n Source:\n");print_matrix();MPI_Type_contiguous(N * N, MPI_INT, &typ);MPI_Type_commit(&typ);MPI_Barrier(MPI_COMM_WORLD);MPI_Send(&(A[0][0]), 1, typ, 1, 0, MPI_COMM_WORLD);

}

Creating custom MPI datatypeto store local workloads




23

else if(r == 1){MPI_Type_vector(N, 1, N, MPI_INT, &typ);MPI_Type_hvector(N, 1, sizeof(int), typ, &typ);MPI_Type_commit(&typ);MPI_Barrier(MPI_COMM_WORLD);MPI_Recv(&(A[0][0]), 1, typ, 0, 0, MPI_COMM_WORLD, &st);printf("\n Transposed:\n");print_matrix();

}

MPI_Finalize();}

Creates a vector datatype of length N strided by a blocklength of 1

Datatype MPI_Type_hvector allows for on the fly transpose of the matrix



Matrix Transpose – MPI (DEMO)

24

[LSU760000@n01 matrix_transpose]$ mpiexiec -np 2 ./mpi_mtrans

Source:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Transposed:0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15



Topics






25



Linear Systems

3333232131

2323222121

1313212111

bxaxaxa

bxaxaxa

bxaxaxa

3

2

1

3

2

1

333231

232221

131211

b

b

b

x

x

x

aaa

aaa

aaa

Solve Ax=b, where A is an n n matrix andb is an n 1 column vector

www.cs.princeton.edu/courses/archive/fall07/cos323/

26



Gauss-Jordan Elimination

• Fundamental operations:

1. Replace one equation with linear combination

of other equations

2. Interchange two equations

3. Re-label two variables

• Combine to reduce to trivial system

• Simplest variant only uses #1 operations but get better

stability by adding

– #2 or

– #2 and #3


27




• Solve:

• Can be represented as

• Goal: to reduce the LHS to an identity matrix resulting

with the solutions in RHS

1354

732

21

21

xx

xx

13

7

54

32

?

?

10

01


28




• Basic operation 1: replace any row by

linear combination with any other row :

replace row1 with 1/2 * row1 + 0 * row2

• Replace row2 with row2 – 4 * row1

• Negate row2

13

7

54

32

1354

1 27

23

110

1 27

23

110

1 27

23


29

Row1 = (Row1)/2

Row2=Row2-(4*Row1)

Row2 = (-1)*Row2




• Replace row1 with row1 – 3/2 * row2

• Solution:

x1 = 2, x2 = 1

110

1 27

23

1

2

10

01


30

Row1 = Row1 – (3/2)* Row2



Pivoting

• Consider this system:

• Immediately run into problem: algorithm wants us to divide by zero!

• More subtle version:

• The pivot or pivot element is the element of a matrix which is

selected first by an algorithm to do computation

• Pivot entry is usually required to be at least distinct from zero, and

often distant from it

• Select largest element in matrix and swap columns and rows to

bring this element to the ‚right’ position: full (complete) pivoting

8

2

32

10

8

2

32

1001.0


31



Pivoting

• Consider this system:

• Pivoting :– Swap rows 1 and 2:

– And continue to solve as shown before

1

8

10

23

1

2

10

01

110

1 38

32


32

x1 = 2, x2 = 1

8

1

23

10



Pivoting:Example

• Division by small numbers round-off error in computer arithmetic

• Consider the following system0.0001x1 + x2 = 1.000

x1 + x2 = 2.000

• exact solution: x1=1.0001 and x2 = 0.9999

• say we round off after 3 digits after the decimal point

• Multiply the first equation by 104 and subtract it from the second equation

• (1 - 1)x1 + (1 - 104)x2 = 2 - 104

• But, in finite precision with only 3 digits:

– 1 - 104 = -0.9999 E+4 ~ -0.999 E+4

– 2 - 104 = -0.9998 E+4 ~ -0.999 E+4

• Therefore, x2 = 1 and x1 = 0 (from the first equation)

• Very far from the real solution!

0.0001 1

1 1

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

33

1

2



Partial Pivoting

• Partial pivoting doesn‘t look for largest element in matrix,

but just for the largest element in the ‚current‘ column

• Swap rows to bring the corresponding row to ‚right‘

position

• Partial pivoting is generally sufficient to adequately

reduce round-off error.

• Complete pivoting is usually not necessary to ensure

numerical stability

• Due to the additional computations it introduces, it may

not always be the most appropriate pivoting strategy

34

http://www.amath.washington.edu/~bloss/amath352_lectures/



Partial Pivoting• One can just swap rows

x1 + x2 = 2.000

0.0001x1 + x2 = 1.000

• Multiple the first equation by 0.0001 and subtract it from the second equation gives:

(1 - 0.0001)x2 = 1 - 0.0001

0.9999 x2 = 0.9999 => x2 = 1

and then x1 = 1

• Final solution is closer to the real solution.

• Partial Pivoting

– For numerical stability, one doesn’t go in order, but pick the next row in rows i to n that has the largest element in row i

– This row is swapped with row i (along with elements of the right hand side) before the subtractions

• the swap is not done in memory but rather one keeps an indirection array

• Total Pivoting

– Look for the greatest element ANYWHERE in the matrix

– Swap columns

– Swap rows

• Numerical stability is really a difficult field


35



Partial Pivoting

36

http://www.amath.washington.edu/~bloss/amath352_lectures/



Special Cases

• Common special case:

• Tri-diagonal Systems :

– Only main diagonal & 1 above,1 below

– Solve using : Gauss-Jordan

• Lower Triangular Systems (L)

– Solve using : forward substitution

• Upper Triangular Systems (U)

– Solve using : backward substitution

4

3

2

1

4443

343332

232221

1211

00

0

0

00

b

b

b

b

aa

aaa

aaa

aa

4

3

2

1

44434241

333231

2221

11

0

00

000

b

b

b

b

aaaa

aaa

aa

a

11

1

1

a

bx

22

1212

2

a

xabx

33

2321313

3

a

xaxabx

5

4

3

2

1

55

4544

353433

25242322

1514131211

0000

000

00

0

b

b

b

b

b

a

aa

aaa

aaaa

aaaaa

55

5

5

a

bx

44

5454

4

a

xabx


37



Topics






38



Solving Linear Systems of Eq.

• Method for solving Linear Systems

– The need to solve linear systems arises in an estimated 75% of all scientific computing problems [Dahlquist 1974]

• Gaussian Elimination is perhaps the most well-known method

– based on the fact that the solution of a linear system is invariant under scaling and under row additions

• One can multiply a row of the matrix by a constant as long as one multiplies the corresponding element of the right-hand side by the same constant

• One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side

– Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:

?

?

?

?

?

x =

equation n-i has i unknowns, with

?


39



Gaussian Elimination1 1 1

1 -2 2

1 2 -1

0

4

2

x =

1 1 1

0 -3 1

0 1 -2

0

4

2

x =

1 1 1

0 -3 1

0 0 -5

0

4

10

x =

Subtract row 1 from rows 2 and 3

Multiple row 3 by 3 and add row 2

-5x3 = 10 x3 = -2

-3x2 + x3 = 4 x2 = -2

x1 + x2 + x3 = 0 x1 = 4

Solving equations in

reverse order (backsolving)


40



Gaussian Elimination

• The algorithm goes through the matrix from the top-left

corner to the bottom-right corner

• The ith step eliminates non-zero sub-diagonal elements

in column i, subtracting the ith row scaled by aji/aii from

row j, for j=i+1,..,n.

i

0

values already computed

values yet to be

updated

pivot row ito

be

ze

roe

d


41



Sequential Gaussian Elimination

Simple sequential algorithm

// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1

// for each row j below row ifor j = i+1 to n

// add a multiple of row i to row jfor k = i to n

A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

• Several “tricks” that do not change the spirit of the algorithm but

make implementation easier and/or more efficient

– Right-hand side is typically kept in column n+1 of the matrix and one speaks of an augmented matrix

– Compute the A(i,j)/A(i,i) term outside of the loop


42



Parallel Gaussian Elimination?

• Assume that we have one processor per matrix element

Reduction

to find the max aji

Broadcast

max aji needed to compute

the scaling factor

Compute

Independent computation

of the scaling factor

Broadcasts

Every update needs the

scaling factor and the

element from the pivot

row

Compute

Independent

computations


43



LU Factorization

• Gaussian Elimination is simple but

– What if we have to solve many Ax = b systems for different values of b?• This happens a LOT in real applications

• Another method is the “LU Factorization” (LU Decomposition)

• Ax = b

• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper triangular matrix O(n3)

• Then Ax = b is written L U x = b

• Solve L y = b O(n2)

• Solve U x = y O(n2)

?

?

?

?

?

?

x =

?

?

?

?

?

?

x =

equation i has i unknowns equation n-i has i unknowns

triangular system solves are easy


44



LU Factorization: Principle

• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.

• Magically, A = L x U !

• Should be done with pivoting as well

1 2 -1

4 3 1

2 2 3

1 2 -1

0 -5 5

2 2 3

gaussian

elimination

save the

scaling

factor

1 2 -1

4 -5 5

2 2 3

gaussian

elimination

+

save the

scaling

factor

1 2 -1

4 -5 5

2 -2 5

gaussian

elimination

+

save the

scaling

factor

1 2 -1

4 -5 5

2 2/5 3

1 0 0

4 1 0

2 2/5 1

L = 1 2 -1

0 -5 5

0 0 3

U =


45



LU Factorization

stores the scaling factors

k

k

LU-sequential(A,n) {

for k = 0 to n-2 {

// preparing column k

for i = k+1 to n-1

aik -aik / akk

for j = k+1 to n-1

// Task Tkj: update of column j

for i=k+1 to n-1

aij aij + aik * akj

}

}


• We’re going to look at the simplest possible version

– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

46



LU Factorization

• We’re going to look at the simplest possible version

– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

LU-sequential(A,n) {

for k = 0 to n-2 {


for i = k+1 to n-1

aik -aik / akk

for j = k+1 to n-1

// Task Tkj: update of column j

for i=k+1 to n-1

aij aij + aik * akj

}

}

k

i

j

k

update


47



Parallel LU on a ring

• Since the algorithm operates by columns from left to right, we should

distribute columns to processors

• Principle of the algorithm

– At each step, the processor that owns column k does the “prepare” task

and then broadcasts the bottom part of column k to all others

• Annoying if the matrix is stored in row-major fashion

• Remember that one is free to store the matrix in anyway one wants, as long

as it’s coherent and that the right output is generated

– After the broadcast, the other processors can then update their data.

• Assume there is a function alloc(k) that returns the rank of the

processor that owns column k

– Basically so that we don’t clutter our program with too many global-to-

local index translations

• In fact, we will first write everything in terms of global indices, as to

avoid all annoying index arithmetic


48



LU-broadcast algorithm

LU-broadcast(A,n) {

q MY_NUM()

p NUM_PROCS()

for k = 0 to n-2 {

if (alloc(k) == q)


for i = k+1 to n-1

buffer[i-k-1] aik -aik / akk

broadcast(alloc(k),buffer,n-k-1)

for j = k+1 to n-1

if (alloc(j) == q)

// update of column j

for i=k+1 to n-1

aij aij + buffer[i-k-1] * akj

}

}


49



Dealing with local indices

• Assume that p divides n

• Each processor needs to store r=n/p columns and its

local indices go from 0 to r-1

• After step k, only columns with indices greater than k will

be used

• Simple idea: use a local index, l, that everyone initializes

to 0

• At step k, processor alloc(k) increases its local index so

that next time it will point to its next local column


50



LU-broadcast algorithm

...

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (alloc(k) == q)

for i = k+1 to n-1

buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]

l l+1


for j = l to r-1

for i=k+1 to n-1

a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]

}

}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

51



Bad load balancing

P1 P2 P3 P4

already

done

already

done working

on it


52



Good Load Balancing?

working

on it

already

done

already

done

Cyclic distribution


53



Load-balanced program

...

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (k mod p == q)

for i = k+1 to n-1


l l+1


for j = l to r-1

for i=k+1 to n-1


}


54



Performance Analysis

• How long does this code take to run?– This is not an easy question because there are many tasks and

many communications

• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)

– n-1 column preparations: (n2/2) w’ + O(1)

– column updates: (n3/3p) w + O(n2)

• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)

• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice

• How can we improve this algorithm?


55



Pipelining on the Ring

• So far, in the algorithm we’ve used a simple broadcast

• Nothing was specific to being on a ring of processors and it’s portable – in fact you could just write raw MPI that just looks like our

pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors

• But it’s not efficient– The n-1 communication steps are not overlapped with

computations

– Therefore Amdahl’s law, etc.

• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast

code we saw at the very beginning throughout the LU code


56



Previous program

...

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (k == q mod p)

for i = k+1 to n-1


l l+1


for j = l to r-1

for i=k+1 to n-1


}


57



LU-pipeline algorithm

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (k == q mod p)

for i = k+1 to n-1


l l+1

send(buffer,n-k-1)

else

recv(buffer,n-k-1)

if (q ≠ k-1 mod p) send(buffer, n-k-1)

for j = l to r-1

for i=k+1 to n-1


}


58



Topics






59




• Matrix Transpose: Slides 17-23

• Gauss Jordan: Slides 26-30

• Pivoting: Slides 31-37

• Special Cases (forward & backward substitution): Slide 35

• LU Decomposition 44-58

60



61


Documents

Paralel Computing