733
Parallel Computing Teacher is Nurbek Saparkhojayev Lecture#1: Introduction to Parallel Computing

Paralel Computing

Embed Size (px)

Citation preview

Page 1: Paralel Computing

Parallel Computing

Teacher is Nurbek SaparkhojayevLecture#1: Introduction to Parallel Computing

Page 2: Paralel Computing

Lecture#1 outline

Background

Why use parallel computing?

Who and What?

Concepts and Terminology

Parallel Computer Memory Architectures

Page 3: Paralel Computing

BackgroundTraditionally, software has been written for serial computation:

* To be run on a single computer having a single Central Processing Unit (CPU);

* A problem is broken into a discrete series of instructions. * Instructions are executed one after another.

* Only one instruction may execute at any moment in time.

Page 4: Paralel Computing

Cont.

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:

* To be run using multiple CPUs * A problem is broken into discrete parts that can be solved concurrently

* Each part is further broken down to a series of instructions * Instructions from each part execute simultaneously on different CPUs

Page 5: Paralel Computing

Parallel Computing

The compute resources can include: * A single computer with multiple processors;

* An arbitrary number of computers connected by a network; * A combination of both.

The computational problem usually demonstrates characteristics such as the ability to be:

* Broken apart into discrete pieces of work that can be solved simultaneously; * Execute multiple program instructions at any moment in time;

* Solved in less time with multiple compute resources than with a single compute resource.

Page 6: Paralel Computing

The Universe is Parallel:

Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. For

example: * Galaxy formation

* Planetary movement * Weather and ocean patterns

* Tectonic plate drift * Rush hour traffic

* Automobile assembly line * Building a space shuttle

* Ordering a hamburger at the drive through.

Page 7: Paralel Computing

The Real World is Massively Parallel

Page 8: Paralel Computing

Uses for Parallel Computing:

Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult scientific and engineering

problems found in the real world. Some examples: * Atmosphere, Earth, Environment

* Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics

* Bioscience, Biotechnology, Genetics * Chemistry, Molecular Sciences

* Geology, Seismology * Mechanical Engineering - from prosthetics to spacecraft * Electrical Engineering, Circuit Design, Microelectronics

* Computer Science, Mathematics

Page 9: Paralel Computing

Some nice photos

Page 10: Paralel Computing

Different applications

Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing

of large amounts of data in sophisticated ways. For example: * Databases, data mining

* Oil exploration * Web search engines, web based business services

* Medical imaging and diagnosis * Pharmaceutical design

* Management of national and multi-national corporations * Financial and economic modeling

* Advanced graphics and virtual reality, particularly in the entertainment industry

* Networked video and multi-media technologies * Collaborative work environments

Page 11: Paralel Computing

Nice photos

Page 12: Paralel Computing

Why use Parallel computing?Main Reasons:

a. - Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel clusters can be built from cheap,

commodity components.

Page 13: Paralel Computing

cont. b. - Solve larger problems: Many problems are so large and/or complex that it is impractical

or impossible to solve them on a single computer, especially given limited computer memory. For example:

* "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources.

* Web search engines/databases processing millions of transactions per second

Page 14: Paralel Computing

cont. c. - Provide concurrency: A single compute resource can only do one thing at a time.

Multiple computing resources can be doing many things simultaneously. For example, the Access Grid (www.accessgrid.org) provides a global collaboration network where

people from around the world can meet and conduct work "virtually".

Page 15: Paralel Computing

cont.d. - Use of non-local resources: Using compute resources on a wide area network,

or even the Internet when local compute resources are scarce. For example: * SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for a

compute power over 528 TeraFLOPS (as of August 04, 2008) * Folding@home (folding.stanford.edu) uses over 340,000 computers for a

compute power of 4.2 PetaFLOPS (as of November 4, 2008)

Page 16: Paralel Computing

cont.e. - Limits to serial computing: Both physical and practical reasons pose significant

constraints to simply building ever faster serial computers: * Transmission speeds - the speed of a serial computer is directly dependent upon

how fast data can move through hardware. Absolute limits are the speed of light (30 cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond).

Increasing speeds necessitate increasing proximity of processing elements. * Limits to miniaturization - processor technology is allowing an increasing number

of transistors to be placed on a chip. However, even with molecular or atomic-level components, a limit will be reached on how small components can be.

* Economic limitations - it is increasingly expensive to make a single processor faster. Using a larger number of moderately fast commodity processors to achieve

the same (or better) performance is less expensive. Decision: Current computer architectures are increasingly relying upon hardware

level parallelism to improve performance: * Multiple execution units

* Pipelined instructions * Multi-core

Page 17: Paralel Computing

Who and What?

Top500.org provides statistics on parallel computing users - the charts below are just a sample. Some things to note:

* Sectors may overlap - for example, research may be classified research. Respondents have to choose between the two.

* "Not Specified" is by far the largest application - probably means multiple applications.

Page 18: Paralel Computing

Who's doing Parallel Computing?

Page 19: Paralel Computing

Future

The Future: * During the past 20 years, the trends indicated by ever

faster networks, distributed systems, and multi-processor computer architectures (even at the desktop level) clearly

show that parallelism is the future of computing.

Page 20: Paralel Computing

Concepts and Terminologyvon Neumann Architecture

* Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers.

* Since then, virtually all computers have followed this basic design, which differed from earlier computers programmed through "hard wiring".

4 main components:1. Memory

2. Control Unit3. Arithmetic Logic Unit

4.Input/Output * Read/write, random access memory is used to store both program instructions and

data o Program instructions are coded data which tell the computer to do something

o Data is simply information to be used by the program * Control unit fetches instructions/data from memory, decodes the instructions and then

sequentially coordinates operations to accomplish the programmed task. * Aritmetic Unit performs basic arithmetic operations * Input/Output is the interface to the human operator

Page 21: Paralel Computing

Von Neumann architecture

Page 22: Paralel Computing

Flynn's Classical TaxonomyThere are different ways to classify parallel computers. One of the more widely

used classifications, in use since 1966, is called Flynn's Taxonomy.Flynn's taxonomy distinguishes multi-processor computer architectures

according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only

one of two possible states: Single or Multiple.There are 4 possible classifications according to Flynn:

SISDSIMDMISDMIMD

Page 23: Paralel Computing

Flynn's Classical Taxonomy-SISD

Single Instruction, Single Data (SISD): * A serial (non-parallel) computer

* Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle

* Single data: only one data stream is being used as input during any one clock cycle

* Deterministic execution * This is the oldest and even today, the most common type of computer

* Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.

Page 24: Paralel Computing

SISD

Page 25: Paralel Computing

SIMD

Single Instruction, Multiple Data (SIMD):

* A type of parallel computer * Single instruction: All processing units execute the same instruction at any given

clock cycle * Multiple data: Each processing unit can operate on a different data element

* Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.

* Synchronous (lockstep) and deterministic execution * Two varieties: Processor Arrays and Vector Pipelines

* Examples: o Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV o Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2,

Hitachi S820, ETA10 * Most modern computers, particularly those with graphics processor units (GPUs)

employ SIMD instructions and execution units.

Page 26: Paralel Computing

SIMD

Page 27: Paralel Computing

SIMD

Page 28: Paralel Computing

Multiple Instruction, Single Data (MISD):

* A single data stream is fed into multiple processing units. * Each processing unit operates on the data independently via independent instruction

streams. * Few actual examples of this class of parallel computer have ever existed. One is the

experimental Carnegie-Mellon C.mmp computer (1971). * Some conceivable uses might be:

o multiple frequency filters operating on a single signal stream o multiple cryptography algorithms attempting to crack a single coded message.

Page 29: Paralel Computing

MISD

Page 30: Paralel Computing

Multiple Instruction, Multiple Data (MIMD)

* Currently, the most common type of parallel computer. Most modern computers fall into this category.

* Multiple Instruction: every processor may be executing a different instruction stream

* Multiple Data: every processor may be working with a different data stream * Execution can be synchronous or asynchronous, deterministic or non-

deterministic * Examples: most current supercomputers, networked parallel computer

clusters and "grids", multi-processor SMP computers, multi-core PCs. * Note: many MIMD architectures also include SIMD execution sub-

components

Page 31: Paralel Computing

MIMD

Page 32: Paralel Computing

Some General Parallel TerminologyTask- A logically discrete section of computational work. A task is typically a program or

program-like set of instructions that is executed by a processor.Parallel Task- A task that can be executed by multiple processors safely (yields correct

results)Serial Execution - Execution of a program sequentially, one statement at a time. In the

simplest sense, this is what happens on a one processor machine. However, virtually all parallel tasks will have sections of a parallel program that must be executed

serially.Parallel Execution- Execution of a program by more than one task, with each task being

able to execute the same or different statement at the same moment in time.Pipelining - Breaking a task into steps performed by different processor units, with inputs

streaming through, much like an assembly line; a type of parallel computing.Shared Memory - From a strictly hardware point of view, describes a computer

architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same

logical memory locations regardless of where the physical memory actually exists.

Page 33: Paralel Computing

TerminologySymmetric Multi-Processor (SMP) -Hardware architecture where multiple

processors share a single address space and access to all resources; shared memory computing.

Distributed Memory - In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can

only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing.

Communications - Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory

bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method

employed.Synchronization - The coordination of parallel tasks in real time, very often

associated with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase.

Page 34: Paralel Computing

Terminology

Granularity- In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.

* Coarse: relatively large amounts of computational work are done between communication events

* Fine: relatively small amounts of computational work are done between communication events Observed Speedup

Observed speedup of a code which has been parallelized, defined as:

wall-clock time of serial execution -----------------------------------

wall-clock time of parallel execution

One of the simplest and most widely used indicators for a parallel program's performance.

Page 35: Paralel Computing

TerminologyParallel Overhead- The amount of time required to coordinate parallel tasks, as opposed to doing

useful work. Parallel overhead can include factors such as: * Task start-up time * Synchronizations

* Data communications * Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.

* Task termination time Massively Parallel- Refers to the hardware that comprises a given parallel system - having many

processors. The meaning of "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors numbering in the hundreds of thousands.

Embarrassingly Parallel- Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks.

Scalability- Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that

contribute to scalability include: * Hardware - particularly memory-cpu bandwidths and network communications

* Application algorithm * Parallel overhead related

* Characteristics of your specific application and coding Multi-core Processors- Multiple processors (cores) on a single chip.

Cluster Computing-Use of a combination of commodity units (processors, networks or SMPs) to build a parallel system.

Supercomputing / High Performance Computing- Use of the world's fastest, largest machines to solve large problems.

Page 36: Paralel Computing

Parallel Computer Memory Architectures

a. Shared Memory- General Characteristics: * Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address

space. * Multiple processors can operate independently but share the same memory

resources. * Changes in a memory location effected by one processor are visible to all

other processors. * Shared memory machines can be divided into two main classes based upon

memory access times: UMA and NUMA.

Page 37: Paralel Computing

UMAUniform Memory Access (UMA):

* Most commonly represented today by Symmetric Multiprocessor (SMP) machines

* Identical processors * Equal access and access times to memory

* Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the

other processors know about the update. Cache coherency is accomplished at the hardware

level.

Page 38: Paralel Computing

UMA

Page 39: Paralel Computing

Non-Uniform Memory Access (NUMA):

* Often made by physically linking two or more SMPs * One SMP can directly access memory of another SMP

* Not all processors have equal access time to all memories * Memory access across link is slower

* If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

Page 40: Paralel Computing

NUMA

Page 41: Paralel Computing

Advantages & DisadvantagesAdvantages:

* Global address space provides a user-friendly programming perspective to memory

* Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages:

* Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared

memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management.

* Programmer responsibility for synchronization constructs that ensure "correct" access of global memory.

* Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of

processors.

Page 42: Paralel Computing

Distributed MemoryGeneral Characteristics:

Like shared memory systems, distributed memory systems vary widely but share a common characteristic.

Distributed memory systems require a communication network to connect inter-processor memory.

Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address

space across all processors.Because each processor has its own local memory, it operates independently.

Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.

When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is

communicated. Synchronization between tasks is likewise the programmer's responsibility.

The network "fabric" used for data transfer varies widely, though it can can be as simple as Ethernet.

Page 43: Paralel Computing

Distributed Memory

Page 44: Paralel Computing

Distributed Memory

Advantages: Memory is scalable with number of processors. Increase the number of

processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference

and without the overhead incurred with trying to maintain cache coherency.Cost effectiveness: can use commodity, off-the-shelf processors and

networking. Disadvantages:

The programmer is responsible for many of the details associated with data communication between processors.

It may be difficult to map existing data structures, based on global memory, to this memory organization.

Non-uniform memory access (NUMA) times

Page 45: Paralel Computing

Hybrid Distributed-Shared Memory

The largest and fastest computers in the world today employ both shared and distributed memory architectures.

The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global.

The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP.

Therefore, network communications are required to move data from one SMP to another.

Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the

foreseeable future.Advantages and Disadvantages: whatever is common to both shared and

distributed memory architectures.

Page 46: Paralel Computing

The end of the first lecture!!

QUESTIONS? Comments? Requests?

Page 47: Paralel Computing

Parallel Computing

Teacher is Nurbek SaparkhojayevLecture#2:Parallel Programming Models

Page 48: Paralel Computing

Models

There are several parallel programming models in common use: o Shared Memory

o Threads o Message Passing

o Data Parallel o Hybrid

Parallel programming models exist as an abstraction above hardware and memory architectures.

Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples:

Page 49: Paralel Computing

1st Model

1. Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach.

Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory". Note: although KSR is no longer in

business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future.

Page 50: Paralel Computing

2nd Model

2. Message passing model on a shared memory machine: MPI on SGI Origin.

The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However,

the ability to send and receive messages with MPI, as is commonly done over a network of distributed memory machines, is not only implemented but is very

commonly used.

* Which model to use is often a combination of what is available and personal choice. There is no "best" model, although there certainly are better

implementations of some models over others.

* The following sections describe each of the models mentioned above, and also discuss some of their actual implementations.

Page 51: Paralel Computing

Shared Memory Model(detailed)

In the shared-memory programming model, tasks share a common address space, which they read and write asynchronously.

Various mechanisms such as locks / semaphores may be used to control access to the shared memory.

An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the

communication of data between tasks. Program development can often be simplified.

An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality.

Keeping data local to the processor that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors

use the same data. Unfortunately, controlling data locality is hard to understand and beyond

the control of the average user.

Page 52: Paralel Computing

Shared Memory Model(detailed)

Implementations:

On shared memory platforms, the native compilers translate user program variables into actual memory addresses, which are global.

No common distributed memory platform implementations currently exist. However, as mentioned previously in the Overview section, the KSR

ALLCACHE approach provided a shared memory view of data even though the physical memory of the machine was distributed.

Page 53: Paralel Computing

Threads Model

In the threads model of parallel programming, a single process can have multiple, concurrent

execution paths.Perhaps the most simple analogy that can be

used to describe threads is the concept of a single program that includes a number of subroutines:

Page 54: Paralel Computing

Threads Model

Page 55: Paralel Computing

Threads Model(Code)

The main program a.out is scheduled to run by the native operating system. a.out loads and acquires all of the necessary system and user resources to run.

a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently.

Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread. Each thread also benefits from a global memory view because it

shares the memory space of a.out.

A thread's work may best be described as a subroutine within the main program. Any thread can execute any subroutine at the same time as other

threads.

Page 56: Paralel Computing

Threads Model

Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more

than one thread is not updating the same global address at any time.

Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed.

Threads are commonly associated with shared memory architectures and operating systems.

Page 57: Paralel Computing

Threads Implementations:

From a programming perspective, threads implementations commonly comprise:

A library of subroutines that are called from within parallel source code A set of compiler directives imbedded in either serial or parallel source

code

In both cases, the programmer is responsible for determining all parallelism.

Threaded implementations are not new in computing. Historically, hardware vendors have implemented their own proprietary versions of threads. These implementations differed substantially from each other making it difficult for

programmers to develop portable threaded applications.

Page 58: Paralel Computing

Threads Implementations:Unrelated standardization efforts have resulted in two very different

implementations of threads: POSIX Threads and OpenMP.# POSIX Threads

* Library based; requires parallel coding * Specified by the IEEE POSIX 1003.1c standard (1995).

* C Language only * Commonly referred to as Pthreads.

* Most hardware vendors now offer Pthreads in addition to their proprietary threads implementations.

* Very explicit parallelism; requires significant programmer attention to detail. # OpenMP

* Compiler directive based; can use serial code * Jointly defined and endorsed by a group of major computer hardware and software vendors. The OpenMP Fortran API was released October 28, 1997.

The C/C++ API was released in late 1998. * Portable / multi-platform, including Unix and Windows NT platforms

* Available in C/C++ and Fortran implementations * Can be very easy and simple to use - provides for "incremental parallelism"

# Microsoft has its own implementation for threads, which is not related to the UNIX POSIX standard or OpenMP.

Page 59: Paralel Computing

More Information:

POSIX Threads tutorial: computing.llnl.gov/tutorials/pthreads

OpenMP tutorial: computing.llnl.gov/tutorials/openMP

Page 60: Paralel Computing

TerminologyPerformance: A quantifiable measure of rate of doing (computational) work

Multiple such measures of performance

Delineated at the level of the basic operation

ops – operations per second

ips – instructions per second

flops – floating operations per second

Rate at which a benchmark program takes to execute

A carefully crafted and controlled code used to compare systems

Linpack Rmax (Linpack flops)

gups (billion updates per second)

others

Two perspectives on performance

Peak performance Maximum theoretical performance possible for a system

Sustained performance Observed performance for a particular workload and run; Varies across workloads and possibly between runs

Page 61: Paralel Computing

ScalabilityThe ability to deliver proportionally greater sustained performance through increased

system resources

Strong Scaling Fixed size application problem. Application size remains constant with increase in system size

Weak Scaling Variable size application problem. Application size scales proportionally with system size

Capability computing – In most pure form: strong scaling. Marketing claims tend toward this class

Capacity computing - Throughput computing. Includes job-stream workloads. In most simple form: weak scaling

Cooperative computing

Interacting and coordinating concurrent processes

Not a widely used term

Also: coordinated computing

Page 62: Paralel Computing

The end of the first half of 2nd Lecture

Questions? Comments? Requests?

Page 63: Paralel Computing

Parallel Computing

Teacher is Nurbek SaparkhojayevLecture#2:Parallel Programming Models

Page 64: Paralel Computing

Message Passing Model

Page 65: Paralel Computing

Message Passing Model

The message passing model demonstrates the following characteristics:# A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number

of machines.

# Tasks exchange data through communications by sending and receiving messages.

# Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive

operation.

Page 66: Paralel Computing

Implementations: * From a programming perspective, message passing impl's commonly

comprise a library of subroutines that are imbedded in source code. The programmer is responsible for determining all parallelism.

* Historically, a variety of message passing libraries have been available since the 1980s. These implementations differed substantially from each other making it difficult for programmers to develop portable applications.

* In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.

* Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2 (MPI-2) was released in 1996. Both MPI specifications are available on the

web at http://www-unix.mcs.anl.gov/mpi/. * MPI is now the "de facto" industry standard for message passing, replacing

virtually all other message passing implementations used for production work. Most, if not all of the popular parallel computing platforms offer at least

one implementation of MPI. A few offer a full implementation of MPI-2. * For shared memory architectures, MPI implementations usually don't use a

network for task communications. Instead, they use shared memory (memory copies) for performance reasons.

Page 67: Paralel Computing

More Info

MPI tutorial: computing.llnl.gov/tutorials/mpi

Page 68: Paralel Computing

Data Parallel ModelThe data parallel model demonstrates the following characteristics: *

o Most of the parallel work focuses on performing operations on a data set. The data set is typically organized into a common structure, such as an

array or cube.

o A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.

o Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".

* On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the

data structure is split up and resides as "chunks" in the local memory of each task.

Page 69: Paralel Computing

Data Parallel Model

Page 70: Paralel Computing

Implementations:

Programming with the data parallel model is usually accomplished by writing a program with data parallel constructs. The constructs can be calls to a data

parallel subroutine library or, compiler directives recognized by a data parallel compiler.

Fortran 90 and 95 (F90, F95): ISO/ANSI standard extensions to Fortran 77. * Contains everything that is in Fortran 77

* New source code format; additions to character set * Additions to program structure and commands * Variable additions - methods and arguments

* Pointers and dynamic memory allocation added * Array processing (arrays treated as objects) added

* Recursive and new intrinsic functions added * Many other new features

Implementations are available for most common parallel platforms.

Page 71: Paralel Computing

HPF

# High Performance Fortran (HPF): Extensions to Fortran 90 to support data parallel programming.

* Contains everything in Fortran 90 * Directives to tell compiler how to distribute data added

* Assertions that can improve optimization of generated code added * Data parallel constructs added (now part of Fortran 95)

HPF compilers were common in the 1990s, but are no longer commonly implemented.

# Compiler Directives: Allow the programmer to specify the distribution and alignment of data. Fortran implementations are available for most common

parallel platforms.# Distributed memory implementations of this model usually have the compiler

convert the program into standard code with calls to a message passing library (MPI usually) to distribute the data to all the processes. All message

passing is done invisibly to the programmer.

Page 72: Paralel Computing

Other Models

Other parallel programming models besides those previously mentioned certainly exist, and will

continue to evolve along with the ever changing world of computer hardware and software. Only three of the more common ones are mentioned

here.

Page 73: Paralel Computing

Hybrid

# In this model, any two or more parallel programming models are combined.

# Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with either the threads model (POSIX threads) or the shared memory model (OpenMP). This hybrid model lends itself well to

the increasingly common hardware environment of networked SMP machines.

# Another common example of a hybrid model is combining data parallel with message passing. As mentioned in the data parallel model section

previously, data parallel implementations (F90, HPF) on distributed memory architectures actually use message passing to transmit data between tasks,

transparently to the programmer.

Page 74: Paralel Computing

Single Program Multiple Data (SPMD)

Page 75: Paralel Computing

SPMD

SPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. # A

single program is executed by all tasks simultaneously.

# At any moment in time, tasks can be executing the same or different instructions within the same program.

# SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. That is, tasks do not necessarily have

to execute the entire program - perhaps only a portion of it.

# All tasks may use different data

Page 76: Paralel Computing

Multiple Program Multiple Data (MPMD)

Page 77: Paralel Computing

MPMD

Like SPMD, MPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming

models. MPMD applications typically have multiple executable object files (programs).

While the application is being run in parallel, each task can be executing the same or different program as other tasks.

# All tasks may use different data

Page 78: Paralel Computing

Parallel Computing

Teacher is Nurbek Saparkhojayev

Lecture#3:Designing Parallel Programs

Page 79: Paralel Computing

Automatic vs. Manual Parallelization

Designing and developing parallel programs has characteristically been a very manual process.

The programmer is typically responsible for both identifying and actually implementing parallelism.

Very often, manually developing parallel codes is a time consuming, complex, error-prone and

iterative process.

For a number of years now, various tools have been available to assist the programmer with

converting serial programs into parallel programs.

The most common type of tool used to automatically parallelize a serial program is a parallelizing

compiler or pre-processor.

A parallelizing compiler generally works in two different ways:

1. Fully Automatic

The compiler analyzes the source code and identifies opportunities for parallelism.

The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on

whether or not the parallelism would actually improve performance.

Loops (do, for) loops are the most frequent target for automatic parallelization.

Page 80: Paralel Computing

Automatic vs. Manual Parallelization

2. Programmer Directed

Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the

compiler how to parallelize the code.

May be able to be used in conjunction with some degree of automatic parallelization also.

If you are beginning with an existing serial code and have time or budget constraints, then automatic

parallelization may be the answer.

However, there are several important caveats that apply to automatic parallelization:

* Wrong results may be produced

* Performance may actually degrade

* Much less flexible than manual parallelization

* Limited to a subset (mostly loops) of code

* May actually not parallelize code if the analysis suggests there are inhibitors or the code

is too complex

The remainder of this section applies to the manual method of developing parallel codes.

Page 81: Paralel Computing

Understand the Problem and the

Program

1.Understand the problem you are trying to solve.

2. Think about the option of parallelizing this problem. Can you parallel this problem or not?

Example of Parallelizable Problem:

Calculate the potential energy for each of several thousand independent conformations of a

molecule. When done, find the minimum energy conformation.

This problem is able to be solved in parallel. Each of the molecular conformations is independently

determinable. The calculation of the minimum energy conformation is also a parallelizable problem.

# Example of a Non-parallelizable Problem:

Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of the formula:

F(n) = F(n-1) + F(n-2)

This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown

would entail dependent calculations rather than independent ones. The calculation of the F(n) value

uses those of both F(n-1) and F(n-2). These three terms cannot be calculated independently and

therefore, not in parallel.

Page 82: Paralel Computing

Understand the Problem and the

Program

3. Identify the program's hotspots:

Know where most of the real work is being done. The majority of scientific and technical

programs usually accomplish most of their work in a few places.

Profilers and performance analysis tools can help here

Focus on parallelizing the hotspots and ignore those sections of the program that account for little

CPU usage.

4. Identify bottlenecks in the program

Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred?

For example, I/O is usually something that slows a program down.

May be possible to restructure the program or use a different algorithm to reduce or eliminate

unnecessary slow areas

5. Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as

demonstrated by the Fibonacci sequence above.

6. Investigate other algorithms if possible. This may be the single most important consideration when

designing a parallel application.

Page 83: Paralel Computing

Partitioning

One of the first steps in designing a parallel program is to break the problem into

discrete "chunks" of work that can be distributed to multiple tasks. This is known

as decomposition or partitioning.

There are two basic ways to partition computational work among parallel tasks:

domain decomposition and functional decomposition.

However, combining these two types of problem decomposition is common and

natural.

Page 84: Paralel Computing

a. Domain Decomposition

In this type of partitioning, the data associated with a problem is decomposed. Each parallel task

then works on a portion of of the data.

Page 85: Paralel Computing

a. Domain DecompositionThere are different ways to partition data:

Page 86: Paralel Computing

b. Functional Decomposition

In this approach, the focus is on the computation that is to be performed rather than on the data

manipulated by the computation. The problem is decomposed according to the work that must be

done. Each task then performs a portion of the overall work.

Page 87: Paralel Computing

b. Functional Decomposition

Functional decomposition lends itself well to problems that can be split into different tasks. For

example:

1. Ecosystem Modeling

Each program calculates the population of a given group, where each group's growth depends on

that of its neighbors. As time progresses, each process calculates its current state, then exchanges

information with the neighbor populations. All tasks then progress to calculate the state at the next

time step.

Page 88: Paralel Computing

b. Functional Decomposition2. Signal Processing:

An audio signal data set is passed through four distinct computational filters. Each filter is a separate

process. The first segment of data must pass through the first filter before progressing to the second.

When it does, the second segment of data passes through the first filter. By the time the fourth

segment of data is in the first filter, all four tasks are busy.

Page 89: Paralel Computing

b. Functional Decomposition3. Climate Modeling

Each model component can be thought of as a separate task. Arrows represent exchanges of data

between components during computation: the atmosphere model generates wind velocity data that

are used by the ocean model, the ocean model generates sea surface temperature data that are

used by the atmosphere model, and so on.

Page 90: Paralel Computing

Communications

Who Needs Communications?

The need for communications between tasks depends upon your problem:

You DON'T need communications:

- Some types of problems can be decomposed and executed in parallel with virtually no need

for tasks to share data. For example, imagine an image processing operation where every pixel in a

black and white image needs to have its color reversed. The image data can easily be distributed to

multiple tasks that then act independently of each other to do their portion of the work.

- These types of problems are often called embarrassingly parallel because they are so

straight-forward. Very little inter-task communication is required.

You DO need communications

- Most parallel applications are not quite so simple, and do require tasks to share data with

each other. For example, a 3-D heat diffusion problem requires a task to know the temperatures

calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect

on that task's data.

Page 91: Paralel Computing

Factors to Consider:

There are a number of important factors to consider when designing your program's inter-task

communications:

Cost of communications

- Inter-task communication virtually always implies overhead.

- Machine cycles and resources that could be used for computation are instead

used to package and transmit data.

- Communications frequently require some type of synchronization between

tasks, which can result in tasks spending time "waiting" instead of doing work.

- Competing communication traffic can saturate the available network

bandwidth, further aggravating performance problems.

Latency vs. Bandwidth

Latency is the time it takes to send a minimal (0 byte) message from point A to point B.

Commonly expressed as microseconds.

Bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed

as megabytes/sec or gigabytes/sec.

Sending many small messages can cause latency to dominate communication overheads. Often

it is more efficient to package small messages into a larger message, thus increasing the effective

communications bandwidth.

Page 92: Paralel Computing

Factors to considerVisibility of communications

With the Message Passing Model, communications are explicit and generally quite visible and under

the control of the programmer.

With the Data Parallel Model, communications often occur transparently to the programmer,

particularly on distributed memory architectures. The programmer may not even be able to know exactly

how inter-task communications are being accomplished.

Synchronous vs. asynchronous communications

Synchronous communications require some type of "handshaking" between tasks that are sharing

data. This can be explicitly structured in code by the programmer, or it may happen at a lower level

unknown to the programmer.

Synchronous communications are often referred to as blocking communications since other work

must wait until the communications have completed.

Asynchronous communications allow tasks to transfer data independently from one another. For

example, task 1 can prepare and send a message to task 2, and then immediately begin doing other

work. When task 2 actually receives the data doesn't matter.

Asynchronous communications are often referred to as non-blocking communications since other

work can be done while the communications are taking place.

Interleaving computation with communication is the single greatest benefit for using asynchronous

communications.

Page 93: Paralel Computing

Factors to consider

Scope of communications

Knowing which tasks must communicate with each other is critical during the design stage of a

parallel code. Both of the two scopings described below can be implemented synchronously or

asynchronously.

Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the

other acting as the receiver/consumer.

Collective - involves data sharing between more than two tasks, which are often specified as

being members in a common group, or collective. Some common variations (there are more):

Page 94: Paralel Computing

Factors to consider

Page 95: Paralel Computing

Factors to consider

Efficiency of communications

Very often, the programmer will have a choice with regard to factors that can affect

communications performance. Only a few are mentioned here.

Which implementation for a given model should be used? Using the Message Passing Model as

an example, one MPI implementation may be faster on a given hardware platform than another.

What type of communication operations should be used? As mentioned previously, asynchronous

communication operations can improve overall program performance.

Network media - some platforms may offer more than one network for communications. Which

one is best?

Page 96: Paralel Computing

Overhead and Complexity

Page 97: Paralel Computing

Synchronization

Types of Synchronization:

1. Barrier- Usually implies that all tasks are involved

Each task performs its work until it reaches the barrier. It then stops, or "blocks".

When the last task reaches the barrier, all tasks are synchronized.

What happens from here varies. Often, a serial section of work must be done. In other cases,

the tasks are automatically released to continue their work.

2. Lock / semaphore - Can involve any number of tasks

Typically used to serialize (protect) access to global data or a section of code. Only one task

at a time may use (own) the lock / semaphore / flag.

The first task to acquire the lock "sets" it. This task can then safely (serially) access the

protected data or code.

Other tasks can attempt to acquire the lock but must wait until the task that owns the lock

releases it. Can be blocking or non-blocking

3. Synchronous communication operations

Involves only those tasks executing a communication operation

When a task performs a communication operation, some form of coordination is required with

the other task(s) participating in the communication. For example, before a task can perform a send

operation, it must first receive an acknowledgment from the receiving task that it is OK to send.

Discussed previously in the Communications section

Page 98: Paralel Computing

The end of the lecture

Questions?Comments?

Page 99: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

PARALLEL COMPUTER ARCHITECTURE

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityJanuary 20, 2011

Page 100: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

2

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 101: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

3

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 102: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

4

Opening Remarks

• This lecture is an introduction to supercomputer architecture– Major parameters, classes, and system level

• Architecture exploits device technology to deliver its innate computation performance potential– Structures and system organization – Semantics of operation and memory (instruction set architecture, ISA)

• Between device technology and architecture is circuit design– Circuit design converts devices to logic gates and higher level logical

structures (e.g. multiplexers, adders)– but this is outside the scope of this course.

• We will assume basic logic abstraction with characterizing properties:– Functional behavior (the logical operation it performs)– Switching speed– Propagation delay or latency– Size and power

Page 103: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

HPC System Stack

5

Science Problems : Environmental Modeling, Physics, Computational Chemistry, etc.Application : Coastal Modeling, Black hole simulations, etc.

Algorithms : PDE, Gaussian Elimination, 12 Dwarves, etc.

Program Source Code

Programming Languages: Fortran, C, C++ , UPC, Fortress, X10, etc.Compilers : Intel C/C++/Fortran Compilers, PGI C/C++/Fortran, IBM XLC, XLC++, XLF, etc.Runtime Systems : Java Runtime, MPI etc.

Operating Systems : Linux, Unix, AIX etc.

Systems Architecture : Vector, SIMD array, MPP, Commodity Cluster

Firmware : Motherboard chipset, BIOS, NIC drivers,

Microarchitectures : Intel/AMD x86, SUN SPARC, IBM Power 5/6

Logic Design : RTL

Circuit Design : ASIC, FPGA, Custom VLSI

Device Technology : NMOS, CMOS, TTL, Optical

Mod

el o

f C

om

pu

tatio

n

Page 104: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

6

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 105: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

7

Performance Factors: Technology Speed

• Latencies– Logic latency time– Processor to memory access latency– Memory access time– Network latency

• Cycle Times– Logic switching speed– On-chip clock speed (clock cycle time)– Memory cycle time

• Throughput– On-chip data transfer rate– Instructions per cycle– Network data rate

• Granularity– Logic density– Memory density– Task Size– Packet Size

Page 106: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

8

Machine Parameters affecting Performance

• Peak floating point performance• Main memory capacity• Bi-section bandwidth• I/O bandwidth• Secondary storage capacity• Organization

– Class of system– # nodes– # processors per node– Accelerators– Network topology

• Control strategy– MIMD– Vector, PVP– SIMD– SPMD

Page 107: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

9

Performance Factors: Parallelism

• Fully independent processing elements operating concurrently on separate tasks– Coarse grained – Communicating Sequential Processes (CSP), – Single Program Multiple Data stream (SPMD)

• Instruction Level Parallelism (ILP)– Fine grained – Single instruction performs multiple operations

• Pipelining– Fine grained– Overlapping sequential operations in execution pipeline– Vector pipelines

• SIMD operations– Fine / Medium grained– Single Instruction stream, Multiple Data stream – ALU arrays

• Overlapping of computation and communication– Fine / Medium grained– Asynchronous– Prefetching

• Multithreading– Medium grained– Separate instruction streams serve single processor

Page 108: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

10

Sources of Performance Degradation (SLOW)

• Starvation– Not enough work to do among distributed resources– Insufficient parallelism– Inadequate load balancing– e.g. : Amdahl's law

• Latency– Time required for response of access to remote data or services– Waiting for access to memory or other parts of the system– e.g. : Local memory access, Network communication

• Overhead– Extra work that has to be done to manage program concurrency and parallel

resources the real work you want to perform– Critical-path work for management of concurrent tasks and parallel resources not

required for sequential execution– e.g. : Synchronization and scheduling

• Waiting for Contention– Delays due to conflicts for use of shared resources.– e.g. : Memory bank conflicts, shared network channels

Page 109: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

11

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 110: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

12

Computer Architecture

• Structure– Functional elements– Organization and balance– Interconnect and Data flow paths

• Semantics– Meaning of the logical constructs– Primitive data types– Manifest as Instruction Set Architecture abstract layer

• Mechanisms– Primitive functions that are usually implemented in hardware or

sometimes firmware– Determines preferred actions and sequences– Enables efficiency and scalability

• Policy– Approach and priorities to accomplishing a goal– e.g., cache replacement policy

Page 111: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

13

Structure

• Functional elements– The form of functional elements made up of more primitive

logical modules– e.g. vector arithmetic unit comprising a pipeline of simple stages

• Organization and balance– Number of major elements of different types– Hierarchy of collections of elements

• Data flow– Interconnection of functional, state, and communication

elements– Control of dataflow paths determines actions of processor and

system

Page 112: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

14

Semantics

• Meaning of the logical constructs– Basic operations that can be performed on data

• Primitive data types– What collections of bits (e.g. word) means– Defines actions that can be performed on binary strings

• Instruction Set Architecture– Defined set of actions that can be performed and data object on

which they can be applied– Encoding of binary strings to represent distinct instructions

• Parallel control constructs– Hardware implemented : vector operations, – Software implemented : MPI libraries

Page 113: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

15

Mechanisms• Primitive functions that are usually implemented in

hardware or sometimes firmware– Lower level than instruction set operations– Multiple such mechanisms contribute to execution of

operation

• Determines preferred actions and sequences– Usually time effective primitives– Usually widely used by many instructions

• Enables efficiency and scalability– Establishes basic performance properties of machine

• Examples– Basic arithmetic and logic unit functions– Thread context switching– TLB (Translation Lookaside Buffer) address translation– Cache line replacement– Branch prediction

Page 114: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

16

Policy• Hardware architecture policies

– Decision of ordering or allocation dependent on criteria– Not all machine decisions are visible to the ISA of the system– Not all machine choices are available to the name space of the

operands– Examples

• Cache structure, size, and speed • Cache replacement policies• Order of operation execution• Branch prediction• Allocation of shared resources• Network routers

• Software system management policies– Scheduling,– Data allocation : partitioning of a problem

Page 115: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

17

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 116: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

18

Parallel Structures & Performance Issues

• Pipelining– Vector processing– Execution pipeline– Performance Issues:

• Pipelining increases throughput : More operations per unit time• Pipelining increases latency time : Operation on single operand pair

can take longer than non-pipelined functional unit

• Multiple Arithmetic Units– Instruction level parallelism– Systolic arrays– Performance Issues:

• Increases peak performance• Requires application instruction level parallelism• Average usually significantly lower than peak

Page 117: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

19

Parallel Structures & Performance Issues

• Multiple processors– MIMD: Separate control– SIMD: Single controller– Multicore– Accelerators

• Performance Issues: Multiple processors require overhead operations– Synchronization– Communications – Possibly cache Coherence

Page 118: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

20

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 119: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

21

Scalability• The ability to deliver proportionally greater sustained performance through

increased system resources• Strong Scaling

– Fixed size application problem– Application size remains constant with increase in system size

• Weak Scaling– Variable size application problem– Application size scales proportionally with system size

• Capability computing– in most pure form: strong scaling– Marketing claims tend toward this class

• Capacity computing– Throughput computing– Includes job-stream workloads

– In most simple form: weak scaling

• Cooperative computing– Interacting and coordinating concurrent processes– Not a widely used term– Also: “coordinated computing”

Page 120: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

22

Performance Metrics

• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput

– Average performance over a period of time– flops, Mflops, Gflops, Tflops, Pflops – flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, ops, Mops …

• Cycles per instruction– cpi – Alternatively: instructions per cycle, ipc

• Memory access latency– cycles per second

• Memory access bandwidth– bytes per second (Bps)– bits per second (bps)– or Gigabytes per second, GBps, GB/s

• Bi-section bandwidth– bytes per second

Page 121: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

23

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 122: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Basic Uni-processor Architecture elements

• I/O Interface

• Memory Interface

• Cache hierarchy

• Register Sets

• Control

• Execution pipeline

• Arithmetic Logic Units

24

Page 123: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

25

Multiprocessor• A general class of system• Integrates multiple processors in to an interconnected ensemble• MIMD: Multiple Instruction Stream Multiple Data Stream• Different memory models

– Distributed memory• Nodes support separate address spaces

– Shared memory• Symmetric multiprocessor• UMA – uniform memory access• Cache coherent

– Distributed shared memory• NUMA – non uniform memory access• Cache coherent

– PGAS• Partitioned global address space• NUMA• Not cache coherence

– Hybrid : Ensemble of distributed shared memory nodes• Massively Parallel Processor, MPP

Page 124: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

26

Massively Parallel Processor

• MPP• General class of large scale multiprocessor• Represents largest systems

– IBM BG/L– Cray XT3

• Distinguished by memory strategy– Distributed memory– Distributed shared memory

• Cache coherent• Partitioned global address space

• Custom interconnect network• Potentially heterogeneous

– May incorporate accelerator to boost peak performance

Page 125: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

DM - MPP

27

Page 126: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

28

IBM Blue Gene/L

Page 127: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Historical Top-500 List

29

Page 128: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

30

BG/L packaging hierarchy

Page 129: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

ASCI REDCompute Nodes 4,536

Service Nodes 32

Disk I/O Nodes 32

System Nodes (Boot) 2

Network Nodes (Ethernet, ATM) 10

System Footprint 1,600 Square Feet

Number of Cabinets 85

System RAM 594 Mbytes

Topology 38x32x2

Node to Node bandwidth - Bi-directional 800 Mbytes/sec

Bi-directional - Cross section Bandwidth 51.6 Gbytes/sec

Total number of Pentiumâ Pro Processors 9,216

Processor to Memory Bandwidth 533 Mbytes/sec

Compute Node Peak Performance 400 MFLOPS

System Peak Performance 1.8 TFLOPS

RAID I/O Bandwidth (per subsystem) 1.0 Gbytes/sec

RAID Storage (per subsystem) 1 Tbyte

31

Page 130: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

ASCI RED : I/O Board

32

Page 131: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

33

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 132: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

34

Pipeline Structures• Partitioning of functional unit into a sequence of stages

– Execution time of each stage is < that of the original unit– Total time through sequence of stages is usually > that of the

original unit• Pipeline permits overlapping of multiple operations

– At any one time: each stage is performing different operation– # of operations being performed in parallel = # stages

• Performance– Pipeline increments at clock rate of slowest pipeline stage– Response time for an operation is product of # stages and clock

cycle time– Throughput = clock rate

• i.e. one operation result per clock cycle of pipeline

• Pipeline structures employed in many parts of a computer architecture – to enable high throughput in the presence of high latency – enable faster clock rates

Page 133: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

35

Pipeline : Concepts

Tc

tp

Tp

pp

cc

pc

pp

cp

tPerf

TPerf

TT

tNT

Tt

1

1

=

=

<

×=

<<

Where :

• Tc is the Logic Latency

• Tp is the aggregated pipeline latency

• tp is the latency for each pipelined step

Page 134: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

36

Vector Processors• Supports fine grained data parallel semantics

– Many instances of same operation performed concurrently under same control element

– Operates on vector data structures rather than single scalar values– Vector-scalar operations

• Scale a vector by a scalar factor (multiply each vector element by scalar)

– Inter-vector operations• e.g., Pair wise multiplies

– Intra-vector operation• Reduction operators• e.g., sum all elements of a vector

• Exploits pipeline structure– Arithmetic units– Vector registers– Overlap of memory banks access cycles– Overlap of communication with computation

• Limited scaling – upper bound on number of pipeline stages

Page 135: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

37

Vector Pipeline Architecture

Vector Register (NR)

Vector ALU (NS Stages)

M1 M2 MN

High speed memory busta : time for memory access

MVM

N

asMV

TNP

ttT

=

+= ∑

Where :

• ta is the time for memory access

• ts is the startup time

• TMV is the combined time for Memory Vector

• PM is the memory performance

• tc is the ALU clock time of each step

:

)(

1

21N

tNN

NmancectorPerforAchievedVe

tcerPerformanIdealVecto

cRs

R

c

×+=

=

NS := NR

PerfR =NR

(NR + NS) ×tc

= NR

2 × (NR) ×tc

= 12 × tc

Page 136: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

38

Cray 1

Th

e C

ray 1

Syste

mC

ray 1

logic b

oard

s

• First announced in 1975-6 • 80 MHz Clock rate• Theoretical peak performance (160

MIPS), average performance 136 megaflops, vector optimized peak performance 150 megaflops

• 1-million 64 bit words of high speed memory

• Manufactured by Cray Research Inc.• First Customer was National Center for

Atmospheric Research (NCAR) for 8.86 million dollars.

src : http://en.wikipedia.org/wiki/Cray-1

Page 137: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

39

SID

E B

AR

Page 138: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

40

Parallel-Vector-Processors: PVP

• Combines strengths of vector and MPP– Efficiency of vector processing

• Capability computing

– Scalability of massively parallel processing• Capacity and cooperative computing

• Two levels of parallelism– Ultra fine grain vector parallelism with vector pipelining– Medium to coarse-grain processor

• Memory model– Alternative ways of organizing memory & address space– Distributed memory

• Shared memory within node of multiple vector processors• Fragmented or decoupled address space between nodes

– Partitioned global address space• Globally accessible address space• No cache coherence between nodes

Page 139: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

PVP (e.g. Cray – XMP)

41

Page 140: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

42

Earth Simulator

src : http://www.es.jamstec.go.jp/esc/eng/

Page 141: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

43

EarthSimulator (Facts)

• Located in Yokohoma, Japan• Size of the entire center about 4 tennis courts• Can execute 35.86 trillion (35,860,000,000,000) FLOPS,

or 35.86 TFLOPS (LINPACK.• Consists of 640 nodes with each node consisting of 8

vector processors and 16 GB of memory• Totaling 5120 processors and 10 Terabytes of memory• Aggregated disk storage of 700 Terabytes and around

1.6 Petabytes of storage in tape drives • Costs about 350 million dollars• First on the Top500 list for 5 consecutive times.

Surpassed by IBM's BlueGene/L prototype on September 24, 2004

Page 142: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011 44

PVP Examples

• Early machines– CRI XMP, YMP, C-90, T-90– Cray 2– Fujitsu VP5000

• SX-8

• Cray X1

Steve Scott

Cray Inc.

Page 143: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

45

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 144: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

46

SIMD Array• SIMD semantics

– Single Instruction stream Multiple Data stream– Data set partitioned in to blocks upon which

• One or two dimensions (vectors or matrices)

– Each data block is processed separately– Each data block is controlled by same instruction sequence

– Data exchange cycle

• SIMD Parallel Structure– Node Array of arithmetic units, each coupled to local memory– Interconnect network for global data exchange– Single controller to issue instructions to array nodes

• Early systems broadcast one instruction at a time• Modern systems point to sequence of cached instructions

• SPMD– Single Program Multiple Data Stream– Microprocessor based system where each node runs same program

Page 145: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

SIMD

47

Page 146: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

48

MI

MD

Sequencer

Simplified SIMD Diagram

Data Processors

. . .

. . . . .

Switch

10 11

20 22

12

21

1n

2n

00 01 02 0n

n0 nn

Instruction BroadcastBus

Control Processor

MDij

Processing Element

Page 147: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

49

CM-2

CM-2 General Specifications :• Processors 65,536 • Memory 512 Mbytes • Memory Bandwidth 300Gbits/Sec • I/O Channels 8 Capacity per Channel 40 Mbytes/Sec

Max. • Transfer Rate 320 Mbytes/Sec • Performance in excess of 2500 MIPS• Floating Point performance in excess of 2.5 GFlops

DataVault Specifications : • Storage Capacity 5 or 10 Gbytes • I/O Interfaces 2 Transfer Rate, • Burst 40 Mbytes/Sec Max. • Aggregate Rate 320 Mbytes/Sec

• Originated at MIT, by Danny Hillis• Commercialized at Thinking Machines Corp. src : http://www.svisions.com/sv/cm-dv.html

Page 148: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

50

ClearSpeed SIMD Accelerator

• 1997 Intel ASCI Red Supercomputer• 1TFLOPS, 2,500 sq.

ft., 800KW, $55Million

• 2007 ClearSpeed + Intel Dense Cluster• 1 TFLOPS, 25 sq. ft.,

<7 KW, <$200K

• Medium-Coarse grained SIMD• 130nm fabrication technology• 250 MHz clock rate• 100 Gflops peak, 66 Gflops sustained

Page 149: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Tsubame

• Heterogeneous computing : Added ClearSpeed Boards

• 648 nodes resulting in 38.5 TFLOPS

• 648 nodes with 360 ClearSpeed boards to 47.38 TFLOPS

51

Page 150: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

52

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 151: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011 53

Special Purpose Devices • SPD• Optimized for a given algorithm or class of

problems• Functional elements and dataflow path mirror

the requirements of a specific algorithm• Usually exploits fine grain parallelism for very

high parallelism• Best for arithmetic (or logic) intensive

applications with limited memory access requirements

• Best for strong temporal and spatial locality• Systolic Arrays are one class of such

machines widely used in digital signal processing

• Examples– MD-Grape first Petaflops machine, for N-body

problem– GPU Graphics Processing Unit, e.g. NVIDIA– FPGA field programmable gate array

• Allows reconfiguration of logic array

Page 152: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011 54

Systolic Arrays

Host

Cell 1

Cell 2

Cell 3

Cell n

Interface Unit

address

XY

Warp Processor Array

XY

Example implementation:Warp architecture

∑=

=n

k

kjikij bac1

A

A

CC

BB

Processing Element

Matrix multiplication on Systolic ArrayReferences:M. Annaratone, E. Arnould, et al, “The Warp Computer: Architecture, Implementation, and Performance”Y. Yang, W. Zhao, and Y. Inoue, “High-Performance Systolic Arrays for Band Matrix Multiplication”

Page 153: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

55

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 154: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

56

Introduction to SMP

• Symmetric Multiprocessor

• Building block for large MPP• Multiple processors

– 2 to 32 processors– Now Multicore

• Uniform Memory Access (UMA) shared memory– Every processor has equal access in equal time to all banks of

the main memory

• Cache coherent– Multiple copies of variable maintained consistent by hardware

Page 155: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

SMP - UMA

57

Page 156: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

58

SMP Node Diagram

USBPeripherals

JTAG

MP

L1L2

MP

L1L2

L3

MP

L1L2

MP

L1L2

L3

M1 M1 Mn-1

Controller

S

S

NIC NIC

Legend : MP : MicroProcessorL1,L2,L3 : CachesM1.. : Memory BanksS : StorageNIC : Network Interface Card

Ethernet

PCI-e

Page 157: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

DSM - NUMA

59

Page 158: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Challenges to Computer Architecture• Expose and exploit extreme fine-grain parallelism

– Possibly multi-billion-way (for Exascale)– Data structure-driven (use meta-data parallelism)

• State storage takes up much more space than logic– 1:1 flops/byte ratio infeasible– Memory access bandwidth is the critical resource

• Latency – can approach a million cycles (10,000 or more cycles, typical)– All actions are local– Contention due to inadequate bandwidth

• Overhead for fine grain parallelism must be very small – or system can not scale– One consequence is that global barrier synchronization is untenable

• Power consumption• Reliability

– Very high replication of elements– Uncertain fault distribution– Fault tolerance essential for good yield

• Design complexity– Impacts development time, testing, power, and reliability

60

Page 159: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

61

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 160: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Multi-Core

• Motivation for Multi-Core– Exploits increased feature-size and density– Increases functional units per chip (spatial efficiency)– Limits energy consumption per operation– Constrains growth in processor complexity

• Challenges resulting from multi-core– Relies on effective exploitation of multiple-thread parallelism

• Need for parallel computing model and parallel programming model– Aggravates memory wall

• Memory bandwidth– Way to get data out of memory banks– Way to get data into multi-core processor array

• Memory latency• Fragments L3 cache

– Pins become strangle point• Rate of pin growth projected to slow and flatten• Rate of bandwidth per pin (pair) projected to grow slowly

– Requires mechanisms for efficient inter-processor coordination• Synchronization• Mutual exclusion• Context switching

62

Page 161: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

IBM Blue Gene/L

63

Page 162: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Intel Core i7

64

Page 163: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

AMD Quad Core Architecture

65

AMD quad-core x86 Opteron processor layout

Page 164: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

66

Page 165: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

IBM/SONY Cell Architecture

• Product of the “STI” alliance: SCEI (Sony), Toshiba and IBM

• Budget estimate ~$400 mil• Primary design center in Austin, TX

(March 2001)• Modified POWER4 toolchain

• The effort took 4 years, with over 400 engineers and 11 IBM centers involved

• Original target applications:

– Sony Playstation 3– IBM blade server– Toshiba HDTV

67

Page 166: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Cell Processor in Numbers

• 234 mil transistors• 221mm2 die on 90nm process• SOI, low-k dielectrics, copper interconnects• 3.2GHz clock speed (over 5Ghz in lab)• Peak performance:

– over 256Gflops @4GHz, single precision– ~26Gflops, double precision– memory bandwidth: 25.6Gbytes/s– I/O bandwidth: 76.8Gbytes/s (48.8 outbound, 32

inbound)• Power consumption undisclosed, estimated at 30W

(MacWorld) or 50-80W (other sources); 5 power states

68

Page 167: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Internal Structure

69

Page 168: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Cell Components and Layout

• One Power Processing Element (PPE)

• Multiple Synergistic Processing Elements (SPE)

• Element Interconnect Bus (EIB)

• Dual channel XDR memory controller

• FlexIO external I/O interface

70

Page 169: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Conventional Strategies to Address the Multi-Core Challenge

• Maintain status quo– Investment in current code stack– Investment in core design

• Increase L2/L3 cache size– Attempt to exploit existing temporal locality

• Increase chip I/O bandwidth– Reduce contention– Eventually embedded optical interfaces chip-to-chip

• Memory bandwidth aggregation through “weaver” chip– Balances processor data demand with memory supply rate– Enables and coordinates multiple overlapping memory banks

• Exploit job stream parallelism– Independent jobs

• O/S scheduling

– Concurrent parametric processes• Multiple instances of same job across parametric set• e.g., Condor

– Coarse grain communicating sequential processes• Message passing; e.g., MPI• Barrier synchronization

71

Page 170: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Limitations of Conventional Incremental Approaches to MultiCore

• Its not just SMP on a chip– Cores on wrong side of the pins– Users expect to see performance gain on existing applications

• Highly sensitive to temporal locality– Fragile in the presence of memory latency– Uses up majority of chip area on caching

• Emphasizes ALU as precious resource– ALU low spatial cost – Memory bandwidth is pacing element for data intensive problems

• Low effective energy usage– Suffers from core complexity

• Does not address intrinsic problems of low efficiency– Just hoping to stay even with Moore’s Law– Single digit sustained/peak performance– Bad when ALU is critical path element

• The Memory Wall is getting Worse!

72

1997 1999 2001 2003 2006 2009

X-Axis

0.1

1

10

100

1000

Tim

e (n

s)

0

100

200

300

400

500

Mem

ory

to C

PU

Rat

io

CPU Clock Period (ns)Memory System Access Time

Ratio

Page 171: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

Commodity Clusters

• Distributed Memory systems

• Superior performance to cost

• Dominant parallel systems architecture on the Top 500 List

• Combines off the shelf systems in scalable structure

• Employs commercial high-bandwidth networks for integration

• Message Passing programming model used (e.g. MPI)

• First cluster on Top500 : Berkley Now, 1997

73

Page 172: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

74

Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous

architectures• Summary – Material for the Test

Page 173: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

75

Summary – Material for the Test

• HPC System Stack – slide 5• Performance factors : Technology speed – slide 7• Performance factors : Parallelism –slide 9• Sources of Performance Degradation – slide 10• Computer architecture – slides 12-16• Parallel Structures – slide 18• Performance issues of parallel structures – slide 19• Scalability – slide 21• Performance Metrics – slide 22• Basic uni-processor architecture elements – slide 24• Multiprocessor architecture slides – slides 25 • MPP systems – slides 26,27• Pipeline structures – slides 34,35• Vector processors – slides 36,37• Parallel vector processors (PVP) – slides 40, 41• SIMD – slides 46, 47• Challenges to computer architecture – slides 60

Page 174: Paralel Computing

CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011

76

Page 175: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

COMMODITY CLUSTERS

Prof. Thomas Sterling

Department of Computer Science

Louisiana State University

January 25, 2011

Page 176: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

2

Page 177: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

3

Page 178: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

4

What is a Commodity Cluster

• It is a distributed/parallel computing system

• It is constructed entirely from commodity subsystems

– All subcomponents can be acquired commercially and separately

– Computing elements (nodes) are employed as fully operational

standalone mainstream systems

• Two major subsystems:

– Compute nodes

– System area network (SAN)

• Employs industry standard interfaces for integration

• Uses industry standard software for majority of services

• Incorporates additional middleware for interoperability among

elements

• Uses software for coordinated programming of elements in parallel

Page 179: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

5

Page 180: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

6

Earth Simulator and

TSUBAME

Page 181: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

7

Red Sky

• One of the largest clusters in the

world (located in Sandia National

Laboratories, USA)

• Sun Blade x6275 system family

• 41616 Cores

• Intel EM64T Xeon X55xx (Nehalem-

EP) 2930 MHz (11.72 GFlops)

• 22104 GB main memory

• Number 10 on TOP500

• Infiniband interconnection

• Peak perforamnce:

487 Tflops

• R_max:

423 Tflops

Page 182: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

8

Commodity Clusters vs “Constellations”

16X16X

16X 16X

System Area Network

64 Processor Constellation

64 Processor Commodity Cluster

4X

4X

4X

4X

4X 4X 4X 4X

4X

4X

4X

4X

4X 4X 4X 4X

System Area Network

• An ensemble of N nodes each comprising p computing elements

• The p elements are tightly bound shared memory (e.g., smp, dsm)

• The N nodes are loosely coupled, i.e., distributed memory

• p is greater than N

• Distinction is which layer gives us the most power through parallelism

Page 183: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

9

Columbia

• NASA’s largest computer

• NASA Ames Research Center

• A Constellation

– 20 nodes

– SGI Altix 512 processor nodes

– Total: 10,240 Intel Itanium-2

processors

• 400 Terabytes of RAID

• 2.5 Petabytes of silo farm tape

storage

Page 184: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

10

Page 185: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

11

A Brief History of Clusters

• 1957 – SAGE by IBM & MIT-LL for Airforce NORAD

• 1976 -- Ethernet

• 1984 – Cluster of 160 Apollo workstations by NSA

• 1985 – M31 Andromeda by DEC, 32 VAX 11/750

• 1986 – Production Condor cluster operational

• 1990 – PVM released

• 1993 – First NOW workstation cluster at UC Berkeley

• 1993 – Myrinet introduced

• 1994 – First Beowulf PC cluster at NASA Goddard

• 1994 – MPI standard

• 1996 – >1Gflops

• 1997 – Gordon Bell Prize for Price-Performance

• 1997 – Berkeley NOW first cluster on Top-500

• 1997 -- >10 Gflops

• 1998 – Avalon by LANL on Top500 list

• 1999 -- >100 Gflops

• 2000 – Compaq and PSC awarded 5 Tflops by NSF

Page 186: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

12

UC-Berkeley NOW Project

• NOW-1 1995

• 32-40 SparcStation 10s and

20s

• originally ATM

• first large myrinet network

NOW-2 1997

100+ Ultra Sparc 170s

128 MB, 2 2GB disks, ethernet, myrinet

largest Myrinet configuration in the world

First cluster on the TOP500 list

Page 187: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

13

NOW Accomplishments

• Early prototypes in 1993 & 1994

• First Inktomi

• Complete Glunix + virtual network environment– able to page many processes onto dedicated

user-level network resources

• NPACI production resource since 1998

• Active Messages demonstrates user level communication in full Unix environment

• First cluster on the TOP500 list

• Set all Parallel Disk-disk sort records (2 yrs)– 500 MB/s disk bandwidth

– 1,000 MB/s network bandwidth

• Basis for studies in novel OS structures

Minute Sort

SGI Power

Challenge

SGI Orgin

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

ab

yte

s s

ort

ed

Page 188: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

14

NASA Beowulf Project

Wiglaf - 1994

16 Intel 80486 100 MHz

VESA Local bus

256 Mbytes memory

6.4 Gbytes of disk

Dual 10 base-T Ethernet

72 Mflops sustained

$40K

Hrothgar - 1995

16 Intel Pentium100 MHz

PCI

1 Gbyte memory

6.4 Gbytes of disk

100 base-T Fast Ethernet

(hub)

240 Mflops sustained

$46K

Hyglac-1996 (Caltech)

16 Pentium Pro 200 MHz

PCI

2 Gbytes memory

49.6 Gbytes of disk

100 base-T Fast Ethernet

(switch)

1.25 Gflops sustained

$50K

Page 189: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

15

Beowulf Accomplishments

• An experiment in parallel computing systems

• Established vision low-cost HPC

• Demonstrated effectiveness of PC clusters for some classes of applications

• Provided networking software in Linux

• Mass Storage with PVFS

• Provided cluster management tools

• Achieved >10 Gflops performance

• Gordon Bell Prize for Price-Performance

• Conveyed findings to broad community

• Tutorials and the book

• Provided design standard to rally community

• Spin-off of Scyld Computing Corp.

Hive at GSFC

Naegling at Caltech CACR

Page 190: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

16

“Do it Yourself Supercomputers”

• Synthesis of just-ready hardware/software elements

• Narrow window of opportunity

• PCs just capable of a few Mflops

• Ethernet LAN (10 base-T) just cheap enough

• A cost constrained requirement with funding

• An open source Unix, albeit immature

• Experience with clustering

• A stable message passing library

• Talent availability to fill the gaps

• Willingness to win or fail

• Modest and well defined goals, vision, and path

Page 191: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

17

Dominance of Clusters in HPC

• Every major HPC vendor (but 1) has a

cluster product

– IBM

– HP

– SUN

– NEC

– Fujitsu

– SGI

– Cray

• Additional vendors dedicated to clusters

– Penguin

– Dell

Page 192: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

18

Page 193: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

19

Clusters Dominate Top-500

Page 194: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

20

Why are Clusters so Prevalent

• Excellent performance to cost for many workloads– Exploits economy of scale

• Mass produced device types

• Mainstream standalone subsystems

– Many competing vendors for similar products

• Just in place configuration– Scalable up and down

– Flexible in configuration

• Rapid tracking of technology advance– First to exploit newest component types

• Programmable– Uses industry standard programming languages and tools

• User empowerment• Low cost, ubiquitous systems

• Programming systems make it relatively easy to program for expert users

Page 195: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

21

1st printing: May, 1999

2nd printing: Aug. 1999

MIT Press

Page 196: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

22

Page 197: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

23

What You Need to Know about Clusters

• Key system elements

– SMP Node

– Interconnect Networks

– Operating Systems

– Resource Management / Scheduling systems

• Programming & Runtime environment

– Message-passing/Cooperative programming model

– Programming languages & compilers, debuggers

• Performance Measurement & Profiling

– How is performance effected

– How to measure how well the applications behave

– How to optimize application behavior

Page 198: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

24

Key Parameters for Cluster Computing

• Peak floating point performance

• Sustained floating point performance

• Main memory capacity

• Bi-section bandwidth

• I/O bandwidth

• Secondary storage capacity

• Organization– Processor architecture

– # processors per node

– # nodes

– Accelerators

– Network topology

• Logistical Issues– Power Consumption

– HVAC / Cooling

– Floor Space (Sq. Ft)

Page 199: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

25

Where’s the Parallelism

• Inter-node

– Multiple nodes

– Primary level for commodity clusters

– Secondary level for constellations

• Multi socket, intra-node

– Routinely 1, 2, 4, 8

– Heterogeneous computing with accelerators

• Multi-core, intra-socket

– 2, 4 cores per socket

• Multi-thread, intra-core

– None or two usually

• ILP, intra-core

– Multiple operations issued per instruction

• Out of order, reservation stations

• Prefetching

• Accelerators

Page 200: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

26

Cluster System

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

MPL1L2

MPL1L2L3

MPL1L2

MPL1L2L3

M1

M1

Mn-1

Controller

S

S

NIC

NIC

Resource management & scheduling subsystem

Login & Cluster Access

Co

mp

ute

No

des

Interco

nn

ect N

etwo

rk

Page 201: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

27

Constituent Hardware Elements

• Compute Nodes (“nodes”)

– Standalone mainstream products

– Processors and accelerators

– Memory and caches

– Chip set

– Interfaces

• System Area Network(s)

– Network interface controllers (NIC)

– Switches

– Cables

• External I/O

– File system

– Internet access

– User interface

– Management and administration

Page 202: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

28

Page 203: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

29

Microprocessor Clock Rate

Page 204: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Technology Trends

30

Page 205: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

31

Compute Node Diagram

MP

L1L2

MP

L1L2

L3

MP

L1L2

MP

L1L2

L3

M0 M1 Mn-1

Controller

S

S

NIC NICUSBPeripherals

JTAG

Legend : MP : MicroProcessorL1,L2,L3 : CachesM1.. : Memory BanksS : StorageNIC : Network Interface Card

Ethernet

PCI-e

Page 206: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Arete Node Picture

32

Page 207: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

33

Parameters for Cluster Nodes

• Processor architecture family (AMD Opteron, Intel Xeon, IBM Power)• Number of processor chips (2)• Number of processor cores per chip (multicore) (3-4)• Memory capacity per processor chip (2 GBytes per core)• Processor core clock rate (3)

– GHz

• Operations per instruction issue, ILP (2 – 4 floating point operations)• Cache size per core (L1, L2, L3)• Distributed or shared memory (SMP) structure

– Cache coherent?

• Number and class of network ports• Latency to main memory (100 – 400 cycles)

– Measured in processor clock cycles

• Disk spindles and capacity (0, 1, or 2)• Ancillary I/O ports• Packaging issues

– Power– Size (1 to 4 u) (http://en.wikipedia.org/wiki/Rack_unit)– Cost

Page 208: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

34

Page 209: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

35

The History of Linux

• Started out with Linus' frustation on available affordable operating

systems for the PC

• He put together a rudimentary scheduler, and later added on more

features until he could bootstrap the kernel (1991).

• The source was released on the internet in hope that more people

would contribute to the kernel

• GCC was ported, a C library was added and a primitive serial and

tty driver code

• Networks, file systems were added

• Slackware

• RedHat

• Extreme Linux

Page 210: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

36

Open Source Software

• Evolution of PC Clusters has benefited from Open Source Software

• Early examples

– Gnu compiler tools, FreeBSD, Linux, PVM

• Advantages

– Provides shared infrastructure – avoids duplication of effort

– Permits wide collaborations

– Facilitates exploratory studies and innovation

• Free software is not necessarily OSS

• Business model in state of flux: how to fund free deliverables

• Important synergy between OSS standard infrastructure software and

proprietary ISV target-specific software:

– OSS provides common framework

– For-profit software provides incentive and resources

Page 211: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011 37

Linux DistributionsAlphanet Linux

Alzza Linux

Andrew Linux

Apokalypse

Armed Linux

ASPLinux

Bad Penguin

Bastille Linux

Best Linux

BlackCat Linux

Blue Linux

Bluecat Linux

BluePoint Linux

Brutalware

Caldera OpenLinux

Cclinux

ChainSaw Linux

CLEClIeNUX

Conectiva

CoolLinux

Coyote Linux

Corel

COX-Linux

Darkstar Linux

Debian Definite

Linux

deepLINUX

Delix

Dlite (Debian Lite)

DragonLinux

Eagle Linux M68K

easyLinux

Elfstone Linux

Embedix

Enoch

Eonova Linux

ESware

Etlinux

Eurielec Linux

FinnixFloppi Gentoo

Linux

Gentus Linux

Green Frog Linux

Halloween Linux

Hard Hat Linux

HispaFuentes

HVLinux

Icepack

Immunix

OSIndependence

InfoMagick Workgroup

Server

Ivrix

ix86 Linux

JBLinux

Jurix Linux

Kondara

Krud

KW Linux

KSI Linux

L13Plus

Laser5

Leetnux

Lightening

Linpus Linux

Linux Antarctica

Linux by Linux

Linux GT Server Edition

Linux Mandrake

Linux MX

LinuxOne

LinuxPPC

LinuxPPP

LinuxSIS

LinuxWare

Linux-YeS

LNX System

Lunet

LuteLinux

LST

Mastodon

MaxOS&trade

MIZI Linux OS

MkLinux

MNIS Linux

MicroLinux

Monkey Linux

NeoLinux

Newlix OfficeServer

NoMad Linux

Ocularis

Open Kernel Linux

Open Share Linux

OS2000

Peanut Linux

PhatLINUX

PingOO

Plamo Linux

Platinum Linux

Power Linux

Progeny Debian

Project Freesco

Prosa Debian

Pygmy Linux

Red Flag Linux

Red Hat Linux

Redmond Linux

Rock Linux

RT-Linux

Scrudge Ware

Secure Linux

Skygate Linux

Slacknet Linux

Slackware

Slinux

SOT Linux

Spiro

Stampede Linux

Storm Linux

S.u.SE

Thin Linux

TINY Linux

Trinux

Trustix Secure Linux

TurboLinux

Turquaz

UltraPenguin

Ute-Linux

VA-enhanced RedHat Linux

VectorLinux

Vedova Linux

Vine Linux

White Dwarf Linux

Whole Linux

WinLinux 2000

WorkGroup Solutions

Linux Pro Plus

Xdenu

Xpresso Linux 2000

XTeam Linux

Yellow Dog Linux

Yggdrasil Linux

ZiiF Linux

ZipHam

ZipSlack

Page 212: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

38

Operating System

• What is an Operating System?

– A program that controls the execution of application programs

– An interface between applications and hardware

• Primary functionality

– Exploits the hardware resources of one or more processors

– Provides a set of services to system users

– Manages secondary memory and I/O devices

• Objectives

– Convenience: Makes the computer more convenient to use

– Efficiency: Allows computer system resources to be used in an

efficient manner

– Ability to evolve: Permit effective development, testing, and

introduction of new system functions without interfering with service

Source: William Stallings “Operating Systems: Internals and Design Principles (5th Edition)”

Page 213: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

39

Services Provided by the OS

• Program development

– Editors and debuggers

• Program execution

• Access to I/O devices

• Controlled access to files

• System access

• Protection

• Error detection and response

– Internal and external hardware errors

– Software errors

– Operating system cannot grant request of application

• Accounting

Page 214: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

40

Layers of Computer System

Page 215: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

41

Resources Managed by the OS

• Processor

• Main Memory

– volatile

– referred to as real memory or primary memory

• I/O modules

– secondary memory devices

– communications equipment

– terminals

• System bus

– communication among processors, memory, and I/O modules

Page 216: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

42

Page 217: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

43

Page 218: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

44

Programming on Clusters

• Several ways of programming application on clusters− Throughput – jobstream

− Decoupled Work Queue Model – SPMD for parameter studies

− Communicating Sequential Processes (CSP)

− Multi threaded

• Throughput: job stream– PBS, Maui

• Decoupled Work Queue Model : SPMD, e.g. parametric studies– Condor

• Communicating Sequential Processes– Message passing

– Distributed memory

– Global barrier synchronization

– e.g., MPI

• Multi threaded– Limited to intra-node programming

– Shared memory

– e.g., OpenMP

Page 219: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Throughput Computing

• Simplest form of parallel computing

• Separate jobs on separate compute nodes

– Independent tasks on independent nodes

• No intra application / cross node communication

• “job stream” workflow

• Capacity computing

– Distinguished from cooperative and capability computing

– Scaling dependent on number of concurrent jobs

• Performance

– Throughput

– Total aggregate operations per second achieved

• Widely used for servers

45

Page 220: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Decoupled Work Queue Model

• Concurrent disjoint tasks

• Parametric Studies

– SPMD (single program multiple data)

• Very coarse grained

• Example software package : Condor

• Processor farms and clusters

• Throughput Computing Lecture covers this model of

parallelism in greater depth

46

Page 221: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

47

Page 222: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011 48

Some Node Interconnect Options

• Current Generation

– Gigabit Ethernet (~1000 Mb/s)

– 10 Gigabit Ethernet

– 40 Gigabit Ethernet and 100 Gigabt Ethernet (100GbE)

standards are in draft as of 2009

– Infiniband (IBA)

• Previous Generation

– Fast Ethernet (~100 Mb/s)

– Myricom’s Myrinet-2000 (~1600 Mb/s)

– SCI (~4000 Mb/s)

– OC-12 ATM (~622 Mb/s)

– Fiber Channel (~100 MB/s)

– USB (12 Mb/s)

– Firewire (IEEE 1394 400 Mb/s)

Page 223: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

49

Fast and Gigabit Ethernet

• Cost effective

• Lucent, 3com, Cisco, etc.

• Directly leverage LAN technology and market

• Up to 384 100 Mbps ports in one switch

• Switches can be stacked on connected with multiple gigabit links

• 100 Base-T:– Bandwidth: > 11 MB/s

– Latency: < 90 microseconds

• 1000 Base-T:– Bandwidth: ~ 50 MB/s

– Latency: < 90 microseconds

Page 224: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

50

Myrinet

• High Performance: 2+2 Gbps

• Low latency: 11 microseconds

• Fiber and copper interconnects

• High Availability – auto reroute

• 4, 8,16 and 64 port switches, stackable

• Scalable to 1000s of hosts

Page 225: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

InfiniBand

51

• High Performance: 10 - 20 Gbps

• Low latency: 1.2 microseconds

• Copper interconnects

• High availability - IEEE 802.3ad Link Aggregation / Channel Bonding

http://www.hpcwire.com/hpc/1342206.html

Page 226: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Network Interconnect Topologies

52

TORUS

FAT-TREE (CLOS)

Page 227: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

53

Dell PowerEdge SC1435

Opteron, IBA

Page 228: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

54

Example: 320-host Clos topology of

16-port switches

64 hosts 64 hosts 64 hosts 64 hosts 64 hosts

(From Myricom)

Page 229: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Arete Infiniband Network

55

Page 230: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

56

Page 231: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

57

Schedulers : PBS

Workload management system – coordinates resource utilization policy and user job requirements– Multi users, Multi jobs, Multi nodes

• Both Open Source and Commercially supported (Veridian)

• Functionality– Manages parallel job execution

– Interactive and batch cross system scheduling

– Security and access control lists

– Dynamic distribution and automatic load-leveling of workload

– Job and user accounting

• Accomplishments– Runs on all Unix and Linux platforms

– Supports MPI

– First release 1995

– 2000 sites registered, 1000 people on the mailing list

– PBSPro sales at >5000 cpu’s

Page 232: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

58

Schedulers : Maui (Moab)

• Cluster Resources Inc.

• Advanced systems software tool for more optimal job

scheduling

• Improved administration and statistical reporting

capabilities

• Analytical simulation capabilities to evaluate different

allocation and prioritization schemes.

• Offers different classes of services to users, allowing

high priority users to be scheduled first, while

preventing long-term starvation of low priority jobs.

• SMP Enabled

Page 233: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

59

Schedulers : Condor

• Distributed Task Scheduler

• Emphasis on throughput or capacity computing

• Services

– Automates cycle harvesting and workstation farms

– Distributed time-sharing and batch processing resource

– Exploits opportunitstic versus dedicated resources

– Permit preemptive acquisition of resources

– Transparent checkpointing

– Remote I/O – preserve local execution environment (require relinking)

– Asynchronous process management, master-worker processing

• Accomplishments

– First production system operational in 1986

– U. of Wisconsin 1300 CPU’s Condor controlled on campus

– Used by:

• large software house for bills and testing,

• Xerox printer simulation,

• Core Digital Pictures rendering of movies,

• INFN for high energy physics,

• 250 machines at NAS, half million hours

Page 234: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

60

Page 235: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

61

MPI Software

• Community wide standard process

– Leveraged experiences with NX, PVM, P4, Zipcode, others

• Dominant programming model for clusters

• Multiple implementations both OSS and commercial (MPI Soft Tech)

– All of MPI-1

– MPI I./O

– All of MPI-2

– MPI-3 under development

• Functionality

– Message passing model for distributed memory platforms

– Support for truly scalable operations (1000s nodes)

• Rich set of collective operations (gathers, reduces, scans, all to all)

• Scalable one sided operations (fence barrier synchronization, group-oriented synchronization)

– Dynamic processes (2) to spawn, disconnect etc. with scalability

• MPICH-2 entirely new rewrite

• OpenMPI includes fault tolerant capability

Page 236: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

62

Page 237: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Compilers & Debuggers

• Compilers : – Intel C/ C++ / Fortran

– PGI C/ C++ / Fortran

– GNU C / C++ / Fortran

• Libraries :– Each compiler is linked against MPICH

– Mesh/Grid Partitioning software : METIS etc.

– Math Kernel Libraries (MKL)

– Intel MKL, AMD MKL, GNU Scientific Library (GSL)

– Data format libraries : NetCDF, HDF 5 etc

– Linear Algebra Packages : BLAS, LAPACK etc

• Debuggers– gdb

– Totalview

• Performance & Profiling tools : – PAPI

– TAU

– Gprof

– perfctr

63

Page 238: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Distributed File Systems

• A distributed file system is a file system that is stored locally on one system (server) but is accessible by processes on many systems (clients).

• Multiple processes access multiple files simultaneously.

• Other attributes of a DFS may include :

– Access control lists (ACLs)

– Client-side file replication

– Server- and client- side caching

• Some examples of DFSes:

– NFS (Sun)

– AFS (CMU)

– PVFS (Clemson, Argonne), OrangeFS

– Lustre (Sun)

– GPFS (IBM)

• Distributed file systems can be used by parallel programs, but they have significant disadvantages :

– The network bandwidth of the server system is a limiting factor on performance

– To retain UNIX-style file consistency, the DFS software must implement some form of locking which has significant performance implications

64

Ohio Supercomputer Center

Page 239: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Distributed File System : NFS

• Popular means for accessing remote file

systems in a local area network.

• Based on the client-server model , the remote

file systems are “mounted” via NFS and

accessed through the Linux virtual file system

(VFS) layer.

• NFS clients cache file data, periodically

checking with the original file for any changes.

• The loosely-synchronous model makes for

convenient, low-latency access to shared

spaces.

• NFS avoids the common locking systems used

to implement POSIX semantics.

65

Page 240: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

66

Parallel Virtual File System (PVFS)

• Clemson University - 1993

• Objective: high throughput file system – DOE, NASA, (GPL)

• Strategy:

– exploit parallelism of bandwidth

– provide user interface so that applications can make powerful requests such as large collection of non-contiguous data with single request for multidimensional data sets,

– allow application direct access to server:

• multiple application tasks directly access/spawn multiple file servers without going through kernel or central mechanism.

• N-clients and N-servers

• Single file spread across multiple disks and nodes and accessed by multiple tasks in an application.

• Scaling facilitated by eliminating single bottleneck

• Actual distribution of a file is configurable on a file by file basis.

• Reactive scheduling addresses problem of network contention and adaptive to file system load

Page 241: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

67

Page 242: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Measuring Performance on Clusters

• Ways of measuring performance– Wall clock time

– Benchmarks

– Processor efficiency factors

– Scalability

– MPI communications and synchronization overhead

– System operations

• Tools– PAPI

– Tau

– Ganglia

– Many others

68

Page 243: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

MPI Performance Measurement : VAMPIR

69src : http://mumps.enseeiht.fr/

Page 244: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

MPI Performance : Tau

70

src : http://www.cs.uoregon.edu/research/tau

Page 245: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

Topics

• Introduction to Commodity Clusters

• A brief history of Cluster computing

• Dominance of Clusters

• Core systems elements of Clusters

• SMP Nodes

• Operating Systems

• DEMO 1 : Arete Cluster Environment

• Throughput Computing

• Networks

• Resource Management / Scheduling Systems

• Message-passing/Cooperative programming model

• Cluster programming/application runtime environment

• Performance measurement & profiling of applications

• Summary Materials for Test

71

Page 246: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

72

Summary – Material for the Test

• What is a commodity cluster – slide 4

• Commodity clusters vs “Constellations” – slide 8

• Key parameters for cluster computing – slide 24

• Where is the parallelism – slide 25

• Parameters for cluster nodes – slide 33

• Node operating system – slide 38,39,40,41

• Programming clusters – slide 44

• Throughput computing – slide 45

• Decoupled work queue model – slide 46

• Interconnect options – slide 48

• Scheduling systems – slide 57, 58, 59

• Message passing : MPI software – slide 61

• Distributed file systems – slide 64

• Measuring performance on cluster: Metrics & Tools – slide 68

Page 247: Paralel Computing

CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011

73

Page 248: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

CAPACITY COMPUTING

Prof.  Thomas  Sterling  Department  of  Computer  Science  Louisiana  State  University  February  1,  2011  

Page 249: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

2

Page 250: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

3

Page 251: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Key Terms and Concepts

4

Problem  

instructions

CPU  

Conven1onal  serial  execu+on  where  the  problem  is  represented  as  a  series  of  instruc1ons  that  are  executed  by  the  CPU  (also  sequen+al  execu+on)  

CPU   CPU   CPU   CPU  

instructions

Task   Task   Task   Task  Problem  

   

Parallel  execu+on  of  a  problem  involves  par11oning  of  the  problem  into  mul1ple  executable  parts  that  are  mutually  exclusive  and  collec1vely  exhaus1ve  represented  as  a  par1ally  ordered  set  exhibi1ng  concurrency.  

Parallel  compu1ng  takes  advantage  of  concurrency  to  :  •  Solve  larger  problems  within  

bounded  1me  •  Save  on  Wall  Clock  Time  •  Overcoming  memory  constraints  •  U1lizing  non-­‐local  resources  

Page 252: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Key Terms and Concepts •  Scalable Speedup : Relative reduction of execution time of a fixed

size workload through parallel execution

•  Scalable Efficiency : Ratio of the actual performance to the best possible performance.

5

Speedup = execution_ time_on_one_ processorexecution_ time_on_N _ processors

Efficiency = execution_ time_on_one_ processorexecution_ time_on_N _ processors!N

Page 253: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

6

Page 254: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Defining the 3 C’s … •  Main Classes of computing :

–  High capacity parallel computing : A strategy for employing distributed computing resources to achieve high throughput processing among decoupled tasks. Aggregate performance of the total system is high if sufficient tasks are available to be carried out concurrently on all separate processing elements. No single task is accelerated. Uses increased workload size of multiple tasks with increased system scale.

–  High capability parallel computing : A strategy for employing tightly coupled structures of computing resources to achieve reduced execution time of a given application through partitioning into concurrently executable tasks. Uses fixed workload size with increased system scale.

–  Cooperative computing : A strategy for employing moderately coupled ensemble of computing resources to increase size of the data set of a user application while limiting its execution time. Uses a workload of a single task of increased data set size with increased system scale.

7

Page 255: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Strong Scaling Vs. Weak Scaling

8  

Machine Scale (# of nodes)

Wor

k pe

r tas

k

Weak Scaling

Strong Scaling

1 2 4 8

Page 256: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Strong Scaling, Weak Scaling

9  

Strong Scaling Weak Scaling

Strong Scaling

Weak Scaling

Tota

l Pro

blem

Siz

e

Machine Scale (# of nodes)

Gra

nula

rity

(siz

e / n

ode)

Page 257: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Defining the 3 C’s … •  High capacity computing systems emphasize the

overall work performed over a fixed time period. Work is defined as the aggregate amount of computation performed across all functional units, all threads, all cores, all chips, all coprocessors and network interface cards in the system.

•  High capability computing systems emphasize improvement (reduction) in execution time of a single user application program of fixed data set size.

•  Cooperative computing systems emphasize single application weak scaling –  Performance increase through increase in problem size

(usually data set size and # of task partitions) with increase in system scale

10

Adapted from : High-performance throughput computing S Chaudhry, P Caprioli, S Yip, M Tremblay - IEEE Micro, 2005 - doi.ieeecomputersociety.org

Page 258: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Strong Scaling, Weak Scaling

11  

Weak Scaling Strong Scaling

Capacity Capability Cooperative Single Job

Workload Size Scaling

•  Capability •  Primary scaling is decrease in response time proportional to increase in resources

applied •  Single job, constant size – goal: response-time scaling proportional to machine size •  Tightly-coupled concurrent tasks making up single job

•  Cooperative •  Single job, (different nodes working on different partitions of the same job) •  Job size scales proportional to machine •  Granularity per node is fixed over range of system scale •  Loosely coupled concurrent tasks making up single job

•  Capacity •  Primary scaling is increase in throughput proportional to increase in resources

applied •  Decoupled concurrent tasks, each a separate job, increasing in number of instances

– scaling proportional to machine.

Page 259: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

12

Page 260: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Models of Parallel Processing •  Conventional models of parallel processing

–  Decoupled Work Queue (covered in segment 1 of the course) –  Communicating Sequential Processing (CSP message passing)

(covered in segment 2) –  Shared memory multiple thread (covered in segment 3)

•  Some alternative models of parallel processing –  SIMD

•  Single instruction stream multiple data stream processor array –  Vector Machines

•  Hardware execution of value sequences to exploit pipelining –  Systolic

•  An interconnection of basic arithmetic units to match algorithm –  Data Flow

•  Data precedent constraint self-synchronizing fine grain execution units supporting functional (single assignment) execution

13

Page 261: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Shared memory multiple Thread

•  Static or dynamic •  Fine Grained •  OpenMP •  Distributed shared memory systems •  Covered in Segment 3

14

Network

CPU 1 CPU 2 CPU 3

Orion JPL N

ASA

memory memory memory

Network

CPU 1 CPU 2 CPU 3

memory memory memory

Symmetric  Mul1  Processor    (SMP  usually  cache  coherent  )  

Distributed  Shared  Memory    (DSM  usually  cache  coherent)  

Page 262: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Communicating Sequential Processes

•  One process is assigned to each processor

•  Work done by the processor is performed on the local data

•  Data values are exchanged by messages

•  Synchronization constructs for inter process coordination

•  Distributed Memory •  Coarse Grained •  MPI application programming interface •  Commodity clusters and MPP

–  MPP is acronym for “Massively Parallel Processor”

•  Covered in Segment 2

15

Network

CPU 1 CPU 2 CPU 3

memory memory memory

Distributed  Memory    (DM  oRen  not  cache  coherent)  

QueenBee

Page 263: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Decoupled Work Queue Model

•  Concurrent disjoint tasks –  Job stream parallelism –  Parametric Studies

•  SPMD (single program multiple data)

•  Very coarse grained •  Example software package : Condor •  Processor farms and commodity clusters •  This lecture covers this model of parallelism

16

Page 264: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

17

Page 265: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Ideal Speedup Issues

18

•  W is total workload measured in elemental pieces of work (e.g. operations, instructions, subtasks, tasks, etc.)

•  T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads)

•  wi is work for a given task I, measured in operations •  Example: here we divide a million (really Mega)

operation workload, W, into a thousand tasks, w1 to w1024 each of a 1 K operations

•  Assume 256 processors performing workload in parallel •  T(256) = 4096 steps, speedup = 256, Eff = 1

Page 266: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Ideal Speedup Example

19

W  

220  

w1   w210   210  

P28  

210   210   210   210  

Processors  

212  

P1  

T(1)=220  T(28)=212  

!

Speedup =220

212= 28

!

Efficiency =220

212 " 28= 20 =1

Units  :  steps  

W = wii!

Page 267: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Granularities in Parallelism Overhead

•  The additional work that needs to be performed in order to manage the parallel resources and concurrent abstract tasks that is in the critical time path.

Coarse Grained •  Decompose problem into large independent

tasks. Usually there is no communication between the tasks. Also defined as a class of parallelism where: “relatively large amounts of computational work is done between communication”

Fine Grained •  Decompose problem into smaller inter-

dependent tasks. Usually these tasks are communication intensive. Also defined as a class of parallelism where: “relatively small amounts of computational work are done between communication events” –www.llnl.gov/computing/tutorials/parallel_comp

20

Images adapted from : http://www.mhpcc.edu/training/workshop/parallel_intro/

Overhead  

Computa1on  

Coarse  Grained  

Overhead  

Computa1on  

Finely  Grained  

Page 268: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Overhead

21

•  Overhead: Additional critical path (in time) work required to manage parallel resources and concurrent tasks that would not be necessary for purely sequential execution

•  V is total overhead of workload execution •  vi is overhead for individual task wi

•  Each task takes vi +wi time steps to complete •  Overhead imposes upper bound on scalability

Page 269: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Overhead

22

v w

V+W=4v+4w  

wi =WP

Tn = v +Wn

S = T1TP

=W +!WP+!

!W

WP+!

=P

1+ P"!W

=P

1+ !WP

v  =  overhead  V  =  Total  overhead  w  =  work  unit  W  =  Total  work  Ti  =  execu1on  1me  with  i  processors  P  =  #  processors  

W = wii=1

P

!Assump1on  :  Workload  is    

infinitely  divisible  

Page 270: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Scalability and Overhead for fixed sized work tasks

23

•  W is divided in to J tasks of size wg •  Each task requires v overhead work to manage •  For P processors there are approximates J/P tasks to be

performed in sequence so, •  TP is J(wg + v)/P •  Note that S = T1 / TP

•  So, S = P / (1 + v / wg)

Page 271: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Scalability & Overhead

24

J = # tasks = Wwg

!

"""

#

$$$%Wwg

T1 =W + v %W

TP =JP& wg + v( ) = W

Pwg

& (wg + v) =WP1+ v

wg

'

())

*

+,,

TP =WP1+ v

wg

'

())

*

+,,

S = T1TP

-W

WP1+ v

wg

'

())

*

+,,

=P

1+ vwg

when W >> v

v  =  overhead  wg  =  work  unit  W  =  Total  work  Ti  =  execu1on  1me  with  i  processors  P  =  #  Processors  J  =  #  Tasks  

Page 272: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

25

Page 273: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Capacity Computing with basic Unix tools

•  Combination of common Unix utilities such as ssh, scp, rsh, rcp can be used to remotely create jobs (to get more information about these commands try man ssh, man scp, man rsh, man rcp on any Unix shell)

•  For small workloads it can be convenient to translate the execution of the program into a simple shell script.

•  Relying on simple Unix utilities poses several application management constraints for cases such as : –  Aborting started jobs –  Querying for free machines –  Querying for job status –  Retrieving job results –  etc..

26

Page 274: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

BOINC , Seti@Home •  BOINC (Berkley Open Infrastructure for Network Computing) •  Opensource software that enables distributed coarse grained

computations over the internet. •  Follows the Master-Worker model, in BOINC : no

communication takes place among the worker nodes •  SETI@Home •  Einstein@Home •  Climate prediction •  And many more…

27

Page 275: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

28

Page 276: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Management Middleware : Condor

•  Designed, developed and maintained at University of Wisconsin Madison by a team lead by Miron Livny

•  Condor is a versatile workload management system for managing pool of distributed computing resources to provide high capacity computing.

•  Assists distributed job management by providing mechanisms for job queuing, scheduling, priority management, tools that facilitate utilization of resources across Condor pools

•  Condor also enables resource management by providing monitoring utilities, authentication & authorization mechanisms, condor pool management utilities and support for Grid Computing middleware such as Globus.

•  Condor Components •  ClassAds •  Matchmaker •  Problem Solvers

29

Page 277: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Condor Components : Class Ads •  ClassAds (Classified Advertisements) concept is

very similar to the newspaper classifieds concepts where buyers and sellers advertise their products using abstract yet uniquely defining named expressions. Example : Used Car Sales

•  ClassAds language in Condor provides well defined means of describing the User Job and the end resources ( storage / computational ) so that the Condor MatchMaker can match the job with the appropriate pool of resources.

Management Middleware : Condor

Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and

Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. http://www.cs.wisc.edu/condor/doc/condor-practice.pdf

30

Page 278: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Job ClassAd & Machine ClassAd

31  

Page 279: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Condor MatchMaker •  MatchMaker, a crucial part of the Condor

architecture, uses the job description classAd provided by the user and matches the Job to the best resource based on the Machine description classAd

•  MatchMaking in Condor is performed in 4 steps : 1.  Job Agent (A) and resources (R) advertise themselves. 2.  Matchmaker (M) processes the known classAds and

generates pairs that best match resources and jobs 3.  Matchmaker informs each party of the job-resource pair of

their prospective match. 4.  The Job agent and resource establish connection for further

processing. (Matchmaker plays no role in this step, thus ensuring separation between selection of resources and subsequent activities)

Management Middleware : Condor

Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and

Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.

http://www.cs.wisc.edu/condor/doc/condor-practice.pdf

32

Page 280: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

33

Page 281: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Condor Problem Solvers •  Master-Worker (MW) is a problem solving system that is

useful for solving a coarse grained problem of indeterminate size such as parameter sweep etc.

•  The MW Solver in Condor consists of 3 main components : work-list, a tracking module, and a steering module. The work-list keeps track of all pending work that master needs done. The tracking module monitors progress of work currently in progress on the worker nodes. The steering module directs computation based on results gathered and the pending work-list and communicates with the matchmaker to obtain additional worker processes.

•  DAGMan is used to execute multiple jobs that have dependencies represented as a Directed Acyclic Graph where the nodes correspond to the jobs and edges correspond to the dependencies between the jobs. DAGMan provides various functionalities for job monitoring and fault tolerance via creation of rescue DAGs.

Management Middleware : Condor

34

Master  

w1   w..N  

Page 282: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Management Middleware : Condor

Indepth  Coverage  :    h^p://www.cs.wisc.edu/condor/publica1ons.html  

Recommended  Reading  :  Douglas  Thain,  Todd  Tannenbaum,  and  Miron  Livny,  "Distributed  Compu1ng  in  Prac1ce:  The  Condor  Experience"    

Concurrency  and  Computa+on:  Prac+ce  and  Experience,  Vol.  17,  No.  2-­‐4,    pages  323-­‐356,  February-­‐April,  2005.  [PDF]  

Todd  Tannenbaum,  Derek  Wright,  Karen  Miller,  and  Miron  Livny,  "Condor  -­‐  A  Distributed  Job  Scheduler",    in  Thomas  Sterling,  editor,  Beowulf  Cluster  Compu+ng  with  Linux,  The  MIT  Press,  2002.  

 ISBN:  0-­‐262-­‐69274-­‐0  [Postscript]  [PDF]  

35

Page 283: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Core components of Condor •  condor_master: This program runs constantly and ensures that all other parts of Condor

are running. If they hang or crash, it restarts them. •  condor_collector: This program is part of the Condor central manager. It collects

information about all computers in the pool as well as which users want to run jobs. It is what normally responds to the condor_status command. It's not running on your computer, but on the main Condor pool host (Arete head node).

•  condor_negotiator: This program is part of the Condor central manager. It decides what jobs should be run where. It's not running on your computer, but on the main Condor pool host (Arete head node).

•  condor_startd: If this program is running, it allows jobs to be started up on this computer--that is, Arete is an "execute machine". This advertises Arete to the central manager (more on that later) so that it knows about this computer. It will start up the jobs that run.

•  condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, desktron is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started.

•  condor_shadow For each job that has been submitted from this computer (e.g., desktron), there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance. You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out.

36

Source : http://www.cs.wisc.edu/condor/tutorials/cw2005-condor/intro.html

Page 284: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Condor : A Walkthrough of Condor commands

condor_status : provides current pool status condor_q : provides current job queue condor_submit : submit a job to condor pool condor_rm : delete a job from job queue

37

Page 285: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

What machines are available ? (condor_status)

condor_status queries resource information sources and provides the current status of the condor pool of resources

38

§  Some common condor_status command line options :  §  -­‐help  :  displays  usage  informa1on  §  -­‐avail  :  queries  condor_startd  ads  and  prints  informa1on  about  available  

resources  §  -­‐claimed  :  queries  condor_startd  ads  and  prints  informa1on  about  claimed  

resources  §  -­‐ckptsrvr  :  queries  condor_ckpt_server  ads  and  display  checkpoint  server  

a^ributes  §  -­‐pool  hostname  queries  the  specified  central  manager  (by  default  queries  

$COLLECTOR_HOST)  §  -­‐verbose  :  displays  en1re  classads    §  For  more  op1ons  and  what  they  do  run  “condor_status  –help”  

Page 286: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

condor_status : Resource States

•  Owner : The machine is currently being utilized by a user. The machine is currently unavailable for jobs submitted by condor until the current user job completes.

•  Claimed : Condor has selected the machine for use by other users.

•  Unclaimed : Machine is unused and is available for selection by condor.

•  Matched : Machine is in a transition state between unclaimed and claimed

•  Preempting : Machine is currently vacating the resource to make it available to condor.

39

Page 287: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Example : condor_status

40

[cdekate@celeritas  ~]$  condor_status                Name                    OpSys              Arch      State            Activity      LoadAv  Mem      ActvtyTime    vm1@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    3+13:42:23  vm2@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    3+13:42:24  vm3@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.010    1964    0+00:45:06  vm4@compute-­‐0  LINUX              X86_64  Owner            Idle              1.000    1964    0+00:00:07  vm1@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    3+13:42:25  vm2@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    1+09:05:58  vm3@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    3+13:37:27  vm4@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    0+00:05:07  …  …  vm3@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    3+13:42:33  vm4@compute-­‐0  LINUX              X86_64  Unclaimed    Idle              0.000    1964    3+13:42:34                                              Total  Owner  Claimed  Unclaimed  Matched  Preempting  Backfill                    X86_64/LINUX        32          3              0                29              0                    0                0                                  Total        32          3              0                29              0                    0                0    

Page 288: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

What jobs are currently in the queue? condor_q

•  condor_q provides a list of job that have been submitted to the Condor pool

•  Provides details about jobs including which cluster the job is running on, owner of the job, memory consumption, the name of the executable being processed, current state of the job, when the job was submitted and how long has the job been running.

41

§  Some common condor_q command line options :  §  -­‐global  :  queries  all  job  queues  in  the  pool  §  -­‐name  :  queries  based  on  the  schedd  name  provides  a  queue  lis1ng  of  the  named  

schedd  §  -­‐claimed  :  queries  condor_startd  ads  and  prints  informa1on  about  claimed  resources  §  -­‐goodput  :  displays  job  goodput  sta1s1cs  (“goodput  is the allocation time when an

application uses a remote workstation to make forward progress.” – Condor Manual)

§  -cputime : displays the remote CPU time accumulated by the job to date... §  For  more  op1ons  run  :  “condor_q  –help”  

Page 289: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

[cdekate@celeritas  ~]$  condor_q                      -­‐-­‐  Submitter:  celeritas.cct.lsu.edu  :  <130.39.128.68:40472>  :  celeritas.cct.lsu.edu    ID            OWNER                        SUBMITTED          RUN_TIME  ST  PRI  SIZE  CMD                                    30.0      cdekate                  1/23  07:52      0+00:01:13  R    0      9.8    fib  100                            30.1      cdekate                  1/23  07:52      0+00:01:09  R    0      9.8    fib  100                            30.2      cdekate                  1/23  07:52      0+00:01:07  R    0      9.8    fib  100                            30.3      cdekate                  1/23  07:52      0+00:01:11  R    0      9.8    fib  100                            30.4      cdekate                  1/23  07:52      0+00:01:05  R    0      9.8    fib  100                          5  jobs;  0  idle,  5  running,  0  held  [cdekate@celeritas  ~]$  

42

Example  :  condor_q  

Page 290: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

How to submit your Job ? condor_submit

•  Create a job classAd (condor submit file) that contains Condor keywords and user configured values for the keywords.

•  Submit the job classAd using “condor_submit” •  Example :

condor_submit matrix.submit •  condor_submit –h provides additional flags

43

[cdekate@celeritas  NPB3.2-­‐MPI]$  condor_submit  -­‐h  Usage:  condor_submit  [options]  [cmdfile]                  Valid  options:                  -­‐verbose                                verbose  output                  -­‐name  <name>                        submit  to  the  specified  schedd                  -­‐remote  <name>                    submit  to  the  specified  remote  schedd                                                                  (implies  -­‐spool)                  -­‐append  <line>                    add  line  to  submit  file  before  processing                                                                  (overrides  submit  file;  multiple  -­‐a  lines  ok)                  -­‐disable                                disable  file  permission  checks                  -­‐spool                                    spool  all  files  to  the  schedd                  -­‐password  <password>        specify  password  to  MyProxy  server                  -­‐pool  <host>                        Use  host  as  the  central  manager  to  query                  If  [cmdfile]  is  omitted,  input  is  read  from  stdin  

Page 291: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

condor_submit : Example

44

[cdekate@celeritas  ~]$  condor_submit  fib.submit    Submitting  job(s).....  Logging  submit  event(s).....  5  job(s)  submitted  to  cluster  35.  [cdekate@celeritas  ~]$  condor_q    -­‐-­‐  Submitter:  celeritas.cct.lsu.edu  :  <130.39.128.68:51675>  :  celeritas.cct.lsu.edu    ID            OWNER                        SUBMITTED          RUN_TIME  ST  PRI  SIZE  CMD                                    35.0      cdekate                  1/24  15:06      0+00:00:00  I    0      9.8    fib  10                              35.1      cdekate                  1/24  15:06      0+00:00:00  I    0      9.8    fib  15                              35.2      cdekate                  1/24  15:06      0+00:00:00  I    0      9.8    fib  20                              35.3      cdekate                  1/24  15:06      0+00:00:00  I    0      9.8    fib  25                              35.4      cdekate                  1/24  15:06      0+00:00:00  I    0      9.8    fib  30                            5  jobs;  5  idle,  0  running,  0  held  [cdekate@celeritas  ~]$    

Page 292: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

How to delete a submitted job ? condor_rm

•  condor_rm : Deletes one or more jobs from Condor job pool. If a particular Condor pool is specified as one of the arguments then the condor_schedd matching the specification is contacted for job deletion, else the local condor_schedd is contacted.

45

[cdekate@celeritas  ~]$  condor_rm  -­‐h                            Usage:  condor_rm  [options]  [constraints]    where  [options]  is  zero  or  more  of:      -­‐help                              Display  this  message  and  exit      -­‐version                        Display  version  information  and  exit      -­‐name  schedd_name      Connect  to  the  given  schedd      -­‐pool  hostname            Use  the  given  central  manager  to  find  daemons      -­‐addr  <ip:port>          Connect  directly  to  the  given  "sinful  string"      -­‐reason  reason            Use  the  given  RemoveReason      -­‐forcex                          Force  the  immediate  local  removal  of  jobs  in  the  X  state                                              (only  affects  jobs  already  being  removed)    and  where  [constraints]  is  one  or  more  of:      cluster.proc                Remove  the  given  job      cluster                          Remove  the  given  cluster  of  jobs      user                                Remove  all  jobs  owned  by  user      -­‐constraint  expr        Remove  all  jobs  matching  the  boolean  expression      -­‐all                                Remove  all  jobs  (cannot  be  used  with  other  constraints)  [cdekate@celeritas  ~]$  

Page 293: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

[cdekate@celeritas  ~]$  condor_q  -­‐-­‐  Submitter:  celeritas.cct.lsu.edu  :  <130.39.128.68:51675>  :  

celeritas.cct.lsu.edu    ID            OWNER                        SUBMITTED          RUN_TIME  ST  PRI  SIZE  CMD                                    41.0      cdekate                  1/24  15:43      0+00:00:03  R    0      9.8    fib  100                            41.1      cdekate                  1/24  15:43      0+00:00:01  R    0      9.8    fib  150                            41.2      cdekate                  1/24  15:43      0+00:00:00  R    0      9.8    fib  200                            41.3      cdekate                  1/24  15:43      0+00:00:00  R    0      9.8    fib  250                            41.4      cdekate                  1/24  15:43      0+00:00:00  R    0      9.8    fib  300                          5  jobs;  0  idle,  5  running,  0  held  [cdekate@celeritas  ~]$  condor_rm  41.4  Job  41.4  marked  for  removal  [cdekate@celeritas  ~]$  condor_rm  41      Cluster  41  has  been  marked  for  removal.  [cdekate@celeritas  ~]$    

46

condor_rm : Example

Page 294: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Creating Condor submit file ( Job a ClassAd )

•  Condor submit file contains key-value pairs that help describe the application to condor.

•  Condor submit files are job ClassAds. •  Some of the common descriptions found in the job

ClassAds are :

47

executable  =    (path  to  the  executable  to  run  on  Condor)  input  =  (standard  input  provided  as  a  file)  output  =  (standard  output  stored  in  a  file)  log  =  (output  to  log  file)  arguments  =    (arguments  to  be  supplied  to  the  queue)    

Page 295: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

DEMO : Steps involved in running a job on Condor.

1.  Creating a Condor submit file 2.  Submitting the Condor submit file to a Condor pool 3.  Checking the current state of a submitted job 4.  Job status Notification

48

Page 296: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Condor Usage Statistics

49

Page 297: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Montage workload implemented and executed using Condor ( Source : Dr. Dan Katz )

•  Mosaicking astronomical images : •  Powerful Telescopes taking high resolution (and highest zoom) pictures of the sky can cover small region over time •  Problem being solved in this project is “stitching” these images together to make a high-resolution zoomed in

snapshot of the sky. •  Aggregate requirements of 140000 CPU hours (~16 years on a single machine) output ranging in the order of 6

TeraBytes

50

Example DAG for 10 input files

mAdd

mBackground

mBgModel

mProject

mDiff

mFitPlane

mConcatFit

Data Stage-in nodes

Montage compute nodes

Data stage-out nodes

Registration nodes

Pegasus

Grid Information Systems

Information about available resources, data

location

Grid

Condor DAGMan

Maps an abstract workflow ���to an executable form

Executes the workflow

MyProxy

User’s grid credentials

h^p://pegasus.isi.edu/  

Page 298: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Montage Use By IPHAS: The INT/WFC Photometric H-alpha Survey of the Northern Galactic Plane

(Source : Dr. Dan Katz)

Supernova  remnant  S147  

Nebulosity in vicinity of HII region, IC 1396B, in Cepheus

Crescent  Nebula  NGC  6888  

Study  extreme  phases  of  stellar    evolu1on  that  

involve  very  large  mass  loss  

51  

Page 299: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

52

Page 300: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

53

•  Throughput computing •  Performance measured as total workload performed over time

to complete •  Overhead factors

–  Start up time –  Input data distribution –  Output result data collection –  Terminate time –  Inter-task coordination overhead (No task coupling)

•  Starvation –  Insufficient work to keep all processors busy –  Inadequate parallelism of coarse grained task parallelism –  Poor or uneven load distribution

Capacity Computing Performance Issues

Page 301: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Topics

•  Key terms and concepts •  Basic definitions •  Models of parallelism •  Speedup and Overhead •  Capacity Computing & Unix utilities •  Condor : Overview •  Condor : Useful commands •  Performance Issues in Capacity Computing •  Material for Test

54

Page 302: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

Summary : Material for the Test •  Key terms & Concepts (4,5,7,8,9,10,11) •  Decoupled work-queue model (16) •  Ideal speedup (18,19) •  Overhead and Scalability (20,21,22,23,24) •  Understand Condor concepts detailed in slides (30,

31,32, 34,35, 36,37) •  Capacity computing performance issues (53) •  Required reading materials :

–  http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf –  Specific pages to focus on : 3-16

55

Page 303: Paralel Computing

CSC  7600  Lecture  5  :  Capacity  Compu1ng,    Spring  2011  

56  

Page 304: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

MESSAGE PASSING INTERFACE MPI

(PART A)

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityFebruary 8, 2011

Page 305: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

2

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 306: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

3

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 307: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Opening Remarks

• Context: distributed memory parallel computers

• We have communicating sequential processes, each

with their own memory, and no access to another

process‟s memory

– A fairly common scenario from the mid 1980s (Intel Hypercube)

to today

– Processes interact (exchange data, synchronize) through

message passing

– Initially, each computer vendor had its own library and calls

– First standardization was PVM

• Started in 1989, first public release in 1991

• Worked well on distributed machines

• Next was MPI

4

Page 308: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

What you‟ll Need to Know

• What is a standard API

• How to build and run an MPI-1 program

• Basic MPI functions

– 4 basic environment functions

• Including the idea of communicators

– Basic point-to-point functions

• Blocking and non-blocking

• Deadlock and how to avoid it

• Data types

– Basic collective functions

• The advanced MPI-1 material may be required for the

problem set

• The MPI-2 highlights are just for information

5

Page 309: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

6

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 310: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI Standard

• From 1992-1994, a community representing both

vendors and users decided to create a standard

interface to message passing calls in the context of

distributed memory parallel computers (MPPs, there

weren‟t really clusters yet)

• MPI-1 was the result

– “Just” an API

– FORTRAN77 and C bindings

– Reference implementation (mpich) also developed

– Vendors also kept their own internals (behind the API)

7

Page 311: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI Standard

• Since then– MPI-1.1

• Fixed bugs, clarified issues

– MPI-2

• Included MPI-1.2

– Fixed more bugs, clarified more issues

• Extended MPI

– New datatype constructors, language interoperability

• New functionality

– One-sided communication

– MPI I/O

– Dynamic processes

• FORTRAN90 and C++ bindings

• Best MPI reference– MPI Standard - on-line at: http://www.mpi-forum.org/

8

Page 312: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

9

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 313: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI : Basics

• Every MPI program must contain the preprocessor directive

• The mpi.h file contains the definitions and declarations necessary for

compiling an MPI program.

• mpi.h is usually found in the “include” directory of most MPI

installations. For example on arete:

10

#include "mpi.h"

...

#include “mpi.h”...

MPI_Init(&Argc,&Argv);

...

...

MPI_Finalize();

...

Page 314: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

11

MPI: Initializing MPI Environment

Function: MPI_init()

int MPI_Init(int *argc, char ***argv)

Description:Initializes the MPI execution environment. MPI_init() must be called before any other MPI functions can be called and it should be called only once. It allows systems to do any special setup so that MPI Library can be used. argc is a pointer to the number of arguments and argv is a pointer to the argument vector. On exit from this routine, all processes will have a copy of the argument list.

...

#include “mpi.h”

...

MPI_Init(&argc,&argv);...

...

MPI_Finalize();

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Init.html

Page 315: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

12

MPI: Terminating MPI Environment

Function: MPI_Finalize()

int MPI_Finalize()

Description:Terminates MPI execution environment. All MPI processes must call this routine before exiting. MPI_Finalize() need not be the last executable statement or even in main; it must be called at somepoint following the last call to any other MPI function.

...

#include ”mpi.h”

...

MPI_Init(&argc,&argv);

...

...

MPI_Finalize();...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Finalize.html

Page 316: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI Hello World

• C source file for a simple MPI Hello World

13

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[]){

MPI_Init( &argc, &argv);printf("Hello, World!\n");MPI_Finalize();return 0;

}

Include header files

Initialize MPI Context

Finalize MPI Context

Page 317: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Building an MPI Executable

• Library version

– User knows where header file and library are, and tells compiler

gcc -Iheaderdir -Llibdir mpicode.c –lmpich

• Wrapper version

– Does the same thing, but hides the details from the user

mpicc -o executable mpicode.c

You can do either one, but don't try to do both!

– use "sh -x mpicc -o executable mpicode.c" to figure out the gcc line

For our “Hello World” example on arete use:

mpicc -o hello hello.c

14

gcc -m64 -O2 -fPIC -Wl,-z,noexecstack -o hello hello.c -I/usr/include/mpich2-x86_64 -L/usr/lib64/mpich2/lib -L/usr/lib64/mpich2/lib -Wl,-rpath,/usr/lib64/mpich2/lib -lmpich -lopa -lpthread -lrt

OR

Page 318: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Running an MPI Executable

• Some number of processes are started somewhere

– Again, standard doesn‟t talk about this

– Implementation and interface varies

– Usually, some sort of mpiexec command starts some number of copies

of an executable according to a mapping

– Example:

„mpiexec -n 2 ./a.out’ command runs two copies of ./a.out where the system

specifies number of processes to be 2

– Most production supercomputing resources wrap the mpiexec command with

higher level scripts that interact with scheduling systems such as PBS /

LoadLeveler for efficient resource management and multi-user support

– Sample PBS / Load Leveler job submission scripts :PBS File:

#!/bin/bash#PBS -l walltime=120:00:00,nodes=8:ppn=4cd /home/cdekate/S1_L2_Demos/adc/pwddatePROCS=`wc -l < $PBS_NODEFILE`mpdboot --file=$PBS_NODEFILEmpiexec -n $PROCS ./padcircmpdallexitdate

LoadLeveler File: #!/bin/bash#@ job_type = parallel#@ job_name = SIMID#@ wall_clock_limit = 120:00:00#@ node = 8#@ total_tasks = 32#@ initialdir = /scratch/cdekate/#@ executable = /usr/bin/poe#@ arguments = /scratch/cdekate/padcirc #@ queue

15

Page 319: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Running the Hello World example

• Using mpiexec : • Using PBS

16

mpd &mpiexec -n 8 ./helloHello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!

hello.pbs : #!/bin/bash#PBS -N hello#PBS -l walltime=00:01:00,nodes=2:ppn=4cd /home/cdekate/2008/l7wddatePROCS=`wc -l < $PBS_NODEFILE`mpdboot -f $PBS_NODEFILEmpiexec -n $PROCS ./hellompdallexitdate

more hello.o10030 /home/cdekate/2008/l7Wed Feb 6 10:58:36 CST 2008Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Wed Feb 6 10:58:37 CST 2008

Page 320: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

17

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 321: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI Communicators

• Communicator is an internal object

• MPI Programs are made up of communicating processes

• Each process has its own address space containing its own attributes such as rank, size (and argc, argv, etc.)

• MPI provides functions to interact with it

• Default communicator is MPI_COMM_WORLD

– All processes are its members

– It has a size (the number of processes)

– Each process has a rank within it

– One can think of it as an ordered list of processes

• Additional communicator(s) can co-exist

• A process can belong to more than one communicator

• Within a communicator, each process has a unique rank

MPI_COMM_WORLD

0

12

5

3

4

6

7

18

Page 322: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

19

MPI: Size of Communicator

Function: MPI_Comm_size()

int MPI_Comm_size ( MPI_Comm comm, int *size )

Description:Determines the size of the group associated with a communicator (comm). Returns an integer number of processes in the group underlying comm executing the program. If comm is an inter-communicator (i.e. an object that has processes of two inter-communicating groups) , return the size of the local group (a size of a group where request is initiated from). The comm in the argument list refers to the communicator-group to be queried, the result of the query (size of the comm group) is stored in the variable size.

...

#include “mpi.h”

...

int size;MPI_Init(&Argc,&Argv);

...

MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

...

err = MPI_Finalize();

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Comm_size.html

Page 323: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

20

MPI: Rank of a process in comm

Function: MPI_Comm_rank()

int MPI_Comm_rank ( MPI_Comm comm, int *rank )

Description:Returns the rank of the calling process in the group underlying the comm. If the comm is an inter-communicator, the call MPI_Comm_rank returns the rank of the process in the local group. The first parameter comm in the argument list is the communicator to be queried, and the second parameter rank is the integer number rank of the process in the group of comm.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Comm_rank.html

...

#include “mpi.h”

...

int rank;MPI_Init(&Argc,&Argv);

...

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);...

err = MPI_Finalize();

...

Page 324: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Example : communicators

21

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[]){

int rank, size;MPI_Init( &argc, &argv);MPI_Comm_rank( MPI_COMM_WORLD, &rank);MPI_Comm_size( MPI_COMM_WORLD, &size);printf("Hello, World! from %d of %d\n", rank, size );MPI_Finalize();return 0;

}

Determines the rank of the current process in the communicator-group

MPI_COMM_WORLD

Determines the size of the communicator-group MPI_COMM_WORLD

… Hello, World! from 1 of 8Hello, World! from 0 of 8Hello, World! from 5 of 8…

Page 325: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Example : Communicator & Rank

• Compiling :

• Result :

22

mpicc -o hello2 hello2.c

Hello, World! from 4 of 8Hello, World! from 3 of 8Hello, World! from 1 of 8Hello, World! from 0 of 8Hello, World! from 5 of 8Hello, World! from 6 of 8Hello, World! from 7 of 8Hello, World! from 2 of 8

Page 326: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

23

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 327: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI : Point to Point Communication

primitives

• A basic communication mechanism of MPI between a pair of processes in which one process is sending data and the other process receiving the data, is called “point to point communication”

• Message passing in MPI program is carried out by 2 main MPI functions– MPI_Send – sends message to a designated process

– MPI_Recv – receives a message from a process

• Each of the send and recv calls is appended with additional information along with the data that needs to be exchanged between application programs

• The message envelope consists of the following information– The rank of the receiver

– The rank of the sender

– A tag

– A communicator

• The source argument is used to distinguish messages received from different processes

• Tag is user-specified int that can be used to distinguish messages from a single process

24

Page 328: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Message Envelope

• Communication across processes is performed using messages.

• Each message consists of a fixed number of fields that is used to distinguish them, called the Message Envelope :

– Envelope comprises source, destination, tag, communicator

– Message = Envelope + Data

• Communicator refers to the namespace associated with the group of related processes

25

MPI_COMM_WORLD

0

12

5

3

4

6

7

Source : process0Destination : process1Tag : 1234Communicator : MPI_COMM_WORLD

Page 329: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

26

MPI: (blocking) Send message

Function: MPI_Send()

int MPI_Send(

void *message,

int count,

MPI_Datatype datatype,

int dest,

int tag,

MPI_Comm comm )

Description:The contents of message are stored in a block of memory referenced by the first parameter

message. The next two parameters, count and datatype, allow the system to determine how much

storage is needed for the message: the message contains a sequence of count values, each having

MPI type datatype. MPI allows a message to be received as long as there is sufficient storage

allocated. If there isn't sufficient storage an overflow error occurs. The dest parameter corresponds

to the rank of the process to which message has to be sent.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Send.html

Page 330: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI : Data Types

MPI datatype C datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

27

You can also define your own (derived datatypes), such as an array of ints of size 100, or more complex examples, such as a struct or an array of structs

Page 331: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI: (blocking) Receive message

28

Function: MPI_Recv()

int MPI_Recv(

void *message,

int count,

MPI_Datatype datatype,

int source,

int tag,

MPI_Comm comm,

MPI_Status *status )

Description:The contents of message are stored in a block of memory referenced by the first parameter message. The

next two parameters, count and datatype, allow the system to determine how much storage is needed for

the message: the message contains a sequence of count values, each having MPI type datatype. MPI

allows a message to be received as long as there is sufficient storage allocated. If there isnt sufficient

storage an overflow error occurs. The source parameter corresponds to the rank of the process from which

the message has been received. The MPI_Status parameter in the MPI_Recv() call returns information on

the data that was actually received It references a record with 2 fields – one for the source and one for the

tag http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Recv.html

Page 332: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI_Status object

29

Object: MPI_Status

Example usage :MPI_Status status;

Description:The MPI_Status object is used by the receive functions to return data about the message, specifically the object contains the id of the process sending the message (MPI_SOURCE), the message tag (MPI_TAG), and error status (MPI_ERROR) .

#include "mpi.h"…

MPI_Status status; /* return status for */…MPI_Init(&argc, &argv);…if (my_rank != 0) {…

MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);}else { /* my rank == 0 */

for (source = 1; source < p; source++ ) {

MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);…MPI_Finalize();

Page 333: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI : Example send/recv

30

/* hello world, MPI style */

#include "mpi.h"#include <stdio.h>#include <string.h>

int main(int argc, char* argv[]) {int my_rank; /* rank of process */int p; /* number of processes */int source; /* rank of sender */int dest; /* rank of receiver */

int tag=0; /* tag for messages */char message[100]; /* storage for message */MPI_Status status; /* return status for */

/* receive */

/* Start up MPI */MPI_Init(&argc, &argv);

/* Find out process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Find out number of processes */MPI_Comm_size(MPI_COMM_WORLD, &p);

Src : Prof. Amy Apon

if (my_rank != 0) {/* Create message */sprintf(message, "Greetings from process %d!", my_rank);dest = 0;/* Use strlen+1 so that \0 gets transmitted */MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,

MPI_COMM_WORLD);}else { /* my rank == 0 */

for (source = 1; source < p; source++ ) {MPI_Recv(message, 100, MPI_CHAR, source, tag,

MPI_COMM_WORLD, &status);printf("%s\n", message);

}printf("Greetings from process %d!\n", my_rank);

}

/* Shut down MPI */MPI_Finalize();

} /* end main */

Page 334: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Communication map for the example.

31

mpiexec -n 8 ./hello3Greetings from process 1!Greetings from process 2!Greetings from process 3!Greetings from process 4!Greetings from process 5!Greetings from process 6!Greetings from process 7!Greetings from process 0!Writing logfile....Finished writing logfile.[cdekate@celeritas l7]$

Page 335: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

32

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 336: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point-to-point Communication

• How two processes interact

• Most flexible communication in MPI

• Two basic varieties

– Blocking and non-blocking

• Two basic functions

– Send and receive

• With these two functions, and the four functions

we already know, you can do everything in MPI

– But there's probably a better way to do a lot things,

using other functions

33

Page 337: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication :

Basic concepts (buffered)

Kernel modeUser mode

Process 0

Kernel modeUser mode

Process 1

Call send Subroutine

Return from send

Subroutine

Copy data from sendbuf to

sysbuf

Send data to the sysbuf at

the receiving end

Call receive Subroutine

Return from receive

Subroutine

Receive data from the sysbuf

at the sending end

Copy data from sysbuf to

recvbuf

sendbuf

sysbuf

sysbuf

recvbuf

Step 1

Step 2

Step 3

1. Data to be sent by the user is copied from the user memory space to the system buffer

2. The data is sent from the system buffer over the network to the system buffer of receiving process

3. The receiving process copies the data from system buffer to local user memory space

34

Page 338: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

MPI communication modes

• MPI offers several different types of communication modes, each having implications on data handling and performance:– Buffered

– Ready

– Standard

– Synchronous

• Each of these communication modes has both blocking and non-blocking primitives– In blocking point to point communication the send call blocks until the send block

can be reclaimed. Similarly the receive function blocks until the buffer has successfully obtained the contents of the message.

– In the non-blocking point to point communication the send and receive calls allow the possible overlap of communication with computation. Communication is usually done in 2 phases: the posting phase and the test for completion phase.

• Synchronization Overhead: the time spent waiting for an event to occur on another task.

• System Overhead: the time spent when copying the message data from sender‟s message buffer to network and from network to the receiver‟s message buffer.

35

Page 339: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication

Blocking Synchronous Send

• The communication mode is selected while invoking the send routine.

• When blocking synchronous send is executed (MPI_Ssend()) , “ready to send” message is sent from the sending task to receiving task.

• When the receive call is executed (MPI_Recv()), “ready to receive” message is sent, followed by the transfer of data.

• The sender process must wait for the receive to be executed and for the handshake to arrive before the message can be transferred. (Synchronization Overhead)

• The receiver process also has to wait for the handshake process to complete. (Synchronization Overhead)

• Overhead incurred while copying from sender & receiver buffers to the network.

36

http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html

Page 340: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication

Blocking Ready Send

• The ready mode send call (MPI_Rsend) sends the message over the network

once the “ready to receive” message is received.

• If “ready to receive” message hasn‟t arrived, the ready mode send will incur an

error and exit. The programmer is responsible to provide for handling errors and

overriding the default behavior.

• The ready mode send call minimizes system overhead and synchronization

overhead incurred during sending of the task.

• The receive still incurs substantial synchronization overhead depending on how

much earlier the receive call is executed.

37

http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html

Page 341: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication

Blocking Buffered Send

• The blocking buffered send call (MPI_Bsend()) copies the data from the message buffer

to a user-supplied buffer and then returns.

• The message buffer can then be reclaimed by the sending process without having any

effect on any data that is sent.

• When the “ready to receive” notification is received the data from the user-supplied buffer

is sent to the receiver.

• Replicated copies of the buffer results in added system overhead.

• Synchronization overhead on the sender process is eliminated as the sending process

does not have to wait on the receive call.

• Synchronization overhead on the receiving process can still be incurred, because if the

receive is executed before the send, the process must wait before it can return to the

execution sequence

38

http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html

Page 342: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication

Blocking Standard Send

• The MPI_Send() operation is implementation dependent

• When the data size is smaller than a threshold value (varies for each implementation):

– The blocking standard send call (MPI_Send()) copies the message over the network

into the system buffer of the receiving node, after which the sending process continues

with the computation

– When the receive call (MPI_Recv()) is executed the message is copied from the

system buffer to the receiving task

– The decreased synchronization overhead is usually at the cost of increased system

overhead due to the extra copy of buffers

39

http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html

Page 343: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication

Buffered Standard Send

• When the message size is greater than a threshold– The behavior is same as for the synchronous mode

– Small messages benefit from the decreased chance of synchronization overhead

– Large messages results in increased cost of copying to the buffer and system overhead

40

http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html

Page 344: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication

Non-blocking Calls

• The non-blocking send call (MPI_Isend()) posts a non-blocking standard send when

the message buffer contents are ready to be transmitted

• The control returns immediately without waiting for the copy to the remote system

buffer to complete. MPI_Wait is called just before the sending task needs to overwrite

the message buffer

• Programmer is responsible for checking the status of the message to know whether

data to be sent has been copied out of the send buffer

• The receiving call (MPI_Irecv()) issues a non-blocking receive as soon as a message

buffer is ready to hold the message. The non-blocking receive returns without waiting

for the message to arrive. The receiving task calls MPI_Wait when it needs to use the

incoming message data

41

http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html

Page 345: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Point to Point Communication

Non-blocking Calls

• When the system buffer is full, the blocking send would have to wait until the receiving task pulled some message data out of the buffer. Use of non-blocking call allows computation to be done during this interval, allowing for interleaving of computation and communication

• Non-blocking calls ensure that deadlock will not result

42

Page 346: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

43

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 347: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Deadlock

• Something to avoid

• A situation where the dependencies between processors

are cyclic

– One processor is waiting for a message from another processor,

but that processor is waiting for a message from the first, so

nothing happens

• Until your time in the queue runs out and your job is killed

• MPI does not have timeouts

44

Page 348: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Deadlock Example

• If the message sizes are small enough, this should

work because of systems buffers

• If the messages are too large, or system buffering is

not used, this will hang

If (rank == 0) {

err = MPI_Send(sendbuf, count, datatype, 1, tag, comm);

err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm, &status);

}else {

err = MPI_Send(sendbuf, count, datatype, 0, tag, comm);

err = MPI_Recv(recvbuf, count, datatype, 0, tag, comm, &status);

}

45

Page 349: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Deadlock Example Solutions

If (rank == 0) {

err = MPI_Send(sendbuf, count, datatype, 1, tag, comm);

err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm, &status);

}else {

err = MPI_Recv(recvbuf, count, datatype, 0, tag, comm, &status);

err = MPI_Send(sendbuf, count, datatype, 0, tag, comm);

}

or

If (rank == 0) {

err = MPI_Isend(sendbuf, count, datatype, 1, tag, comm, &req);

err = MPI_Irecv(recvbuf, count, datatype, 1, tag, comm);

err = MPI_Wait(req, &status);

}else {

err = MPI_Isend(sendbuf, count, datatype, 0, tag, comm, &req);

err = MPI_Irecv(recvbuf, count, datatype, 0, tag, comm);

err = MPI_Wait(req, &status);

}

46

Page 350: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

47

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 351: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Numerical Integration Using Trapezoidal

Rule: A Case Study

• In review, the 6 main MPI calls:

– MPI_Init

– MPI_Finalize

– MPI_Comm_size

– MPI_Comm_rank

– MPI_Send

– MPI_Recv

• Using these 6 MPI function calls we can begin to

construct several kinds of parallel applications

• In the following section we discuss how to use these 6

calls to parallelize Trapezoidal Rule

48

Page 352: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Approximating Integrals: Definite Integral

• Problem : to find an approximate value to a definite

integral

• A definite integral from a to b of a non negative function

f(x) can be thought of as the area bound by the X-axis,

the vertical lines x=a and x=b, and the graph of f(x)

49

Page 353: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Approximating Integrals : Trapezoidal Rule

• Approximating area under the curve can be

done by dividing the region under the curve into

regular geometric shapes and then adding the

areas of the shapes.

• In Trapezoidal Rule, the region between a and b

can be divided into n trapezoids of base

h = (b-a)/n

• The area of a trapezoid can be calculated as

• In the case of our function the area for the first

block can be represented as

• The area under the curve bounded by a & b can

be approximated as :

50

h ( f ( a ) f (a h ))

2

h ( f (a ) f ( a h ))

2

h ( f ( a h ) f (a 2 h ))

2

h ( f (a 2 h ) f (a 3 h ))

2

h ( f ( a 3h ) f (b ))

2

h (b1

b2)

2

h (b1

b2)

2

Page 354: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Approximating Integrals: Trapezoid Rule

• We can further generalize this concept of approximation

of integrals as a summation of trapezoidal areas

51

1

2h [ f ( x

0) f ( x

1)]

1

2h [ f ( x

1) f ( x

2)] ...

1

2h [ f ( x

n 1) f ( x

n)]

h

2[ f ( x

0) f ( x

1) f ( x

1) f ( x

2) ... f ( x

n 1) f ( x

n)]

h

2[ f ( x

0) 2 f ( x

1) 2 f ( x

2) ... 2 f ( x

n 1) f ( x

n)]

hf ( x

0)

2f ( x

1) f ( x

2) ... f ( x

n 1)

f ( xn)

2

Page 355: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Trapezoidal Rule – Serial / Sequential

program in C

52

/* serial.c -- serial trapezoidal rule** Calculate definite integral using trapezoidal rule.* The function f(x) is hardwired.* Input: a, b, n.* Output: estimate of integral from a to b of f(x)* using n trapezoids.** See Chapter 4, pp. 53 & ff. in PPMPI.*/

#include <stdio.h>

main() {float integral; /* Store result in integral */float a, b; /* Left and right endpoints */int n; /* Number of trapezoids */float h; /* Trapezoid base width */float x;int i;float f(float x); /* Function we're integrating */

printf("Enter a, b, and n\n");scanf("%f %f %d", &a, &b, &n);

h = (b-a)/n;integral = (f(a) + f(b))/2.0;x = a;for (i = 1; i <= n-1; i++) {

x = x + h;integral = integral + f(x);

}integral = integral*h;

printf("With n = %d trapezoids, our estimate\n",n);

printf("of the integral from %f to %f = %f\n",a, b, integral);

} /* main */

float f(float x) {float return_val;/* Calculate f(x). Store calculation in return_val. */

return_val = x*x;return return_val;

} /* f */

Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

Page 356: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Results for the Serial Trapezoidal Rule

a b n f(x) single precision f(x) double precision

2 25 1 7233.500000 7233.500000

2 25 2 5712.625000 5712.625000

2 25 10 5225.945312 5225.945000

2 25 30 5207.916992 5207.919815

2 25 40 5206.934082 5206.934062

2 25 50 5206.475098 5206.477800

2 25 1000 5205.664551 5205.668694

53

Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

a b n f(x) single precision f(x) double precision

2 25 1 7233.500000 7233.500000

2 25 2 5712.625000 5712.625000

2 25 10 5225.945312 5225.945000

2 25 30 5207.916992 5207.919815

2 25 40 5206.934082 5206.934062

2 25 50 5206.475098 5206.477800

2 25 1000 5205.664551 5205.668694

Page 357: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Parallelizing Trapezoidal Rule

• One way of parallelizing Trapezoidal rule :– Distribute chunks of workload (each workload

characterized by its own subinterval of [a,b] to each process)

– Calculate f for each subinterval

– Finally add the f calculated for all the sub intervals to produce result for the complete problem [A,B]

• Issues to consider– Number of trapezoids (n) are equally divisible

across (p) processes (load balancing).

– First process calculates the area for the first

n/p trapezoids, second process calculates the area for the next n/p trapezoids and so on

• Key information related to the problem that each process needs is the – Rank of the process

– Ability to derive the workload per processor as a function of rank

Assumption : Process 0 does the summation

54

1 2

Page 358: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Parallelizing Trapezoidal Rule

• AlgorithmAssumption: Number of trapezoids n is evenly divisible across p processors

– Calculate:

– Each process calculates its own workload (interval to integrate)

• local number of trapezoids ( local_n) = n/p

• local starting point (local_a) = a+(process_rank *local_n* h)

• local ending point (local_b) = (local_a + local_n * h)

– Each process calculates its own integral for the local intervals

• For each of the local_n trapezoids calculate area

• Aggregate area for local_n trapezoids

– If PROCESS_RANK == 0

• Receive messages (containing sub-interval area aggregates) from all processors

• Aggregate (ADD) all sub-interval areas

– If PROCESS_RANK > 0

• Send sub-interval area to PROCESS_RANK(0)

Classic SPMD: all processes run the same program on different datasets.

55

hb a

n

Page 359: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Parallel Trapezoidal Rule

56

#include <stdio.h>#include "mpi.h”

main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float a = 0.0; /* Left endpoint */float b = 1.0; /* Right endpoint */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;

Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

Page 360: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Parallel Trapezoidal Rule

57

float Trap(float local_a, float local_b, int local_n,float h); /* Calculate local integral */

/* Let the system do what it needs to start up MPI */MPI_Init(&argc, &argv);

/* Get my process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Find out how many processes are being used */MPI_Comm_size(MPI_COMM_WORLD, &p);

h = (b-a)/n; /* h is the same for all processes */local_n = n/p; /* So is the number of trapezoids */

/* Length of each process' interval of* integration = local_n*h. So my interval* starts at: */

local_a = a + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h); Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

Page 361: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Parallel Trapezoidal Rule

58

/* Add up the integrals calculated by each process */if (my_rank == 0) {

total = integral;for (source = 1; source < p; source++) {

MPI_Recv(&integral, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);

total = total + integral;}

} else { MPI_Send(&integral, 1, MPI_FLOAT, dest,

tag, MPI_COMM_WORLD);}/* Print the result */if (my_rank == 0) {

printf("With n = %d trapezoids, our estimate\n",n);

printf("of the integral from %f to %f = %f\n",a, b, total);

}/* Shut down MPI */MPI_Finalize();

} /* main */ Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

Page 362: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Parallel Trapezoidal Rule

59

float Trap(float local_a /* in */,float local_b /* in */,int local_n /* in */,float h /* in */) {

float integral; /* Store result in integral */float x;int i;

float f(float x); /* function we're integrating */

integral = (f(local_a) + f(local_b))/2.0;x = local_a;for (i = 1; i <= local_n-1; i++) {

x = x + h;integral = integral + f(x);

}integral = integral*h;return integral;

} /* Trap */float f(float x) {

float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;

} /* f */Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

Page 363: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Parallel Trapezoidal Rule

60

[cdekate@celeritas l7]$ mpiexec -n 8 … trapWith n = 1024 trapezoids, our estimateof the integral from 2.000000 to 25.000000 = 5205.667969Writing logfile....Finished writing logfile.

[cdekate@celeritas l7]$ ./serial Enter a, b, and n2 25 1024With n = 1024 trapezoids, our estimateof the integral from 2.000000 to 25.000000 = 5205.666016[cdekate@celeritas l7]$

Page 364: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

61

Topics

• Introduction

• MPI Standard

• MPI-1 Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 365: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Profiling Applications

• To Profile your parallel applications:1. Compile the applications with

mpicc -profile=mpe_mpilog -o trap trap.c

2. Run your applications using the standard procedure using PBS/mpirun

3. After your run is complete you might see lines like these in your stdout (standardout / output file of your pbs-based runWriting logfile....Finished writing logfile.

4. You will also see a file with an extension “clog2”

5. I.e. if your executable was named “parallel_program” you would see a file named “parallel_program.clog2”

6. Convert the “clog2” to “slog2” format by issuing the command“clog2TOslog2 parallel_program.clog2”Maintain the capitalization in the clog2TOslog2 command

7. Step 6 will result in a parallel_program.slog2 file

8. Use Jumpshot to visualize this file

62

Page 366: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Using Jumpshot

Note : You need Java Runtime Environment on your

machine in order to be able to run Jumpshot

Download your parallel_program.slog2 file from Arete

• Download Jumpshot from :

– ftp://ftp.mcs.anl.gov/pub/mpi/slog2/slog2rte.tar.gz

– Uncompress the tar.gz file to get a folder : slog2rte-1.2.6/

– In the slog2rte-1.2.6/lib/

type java -jar jumpshot.jar parallel_program.slog2

• Or click on the jumpshot_launcher.jar

• Open the file using the jumpshot

file menu

63

Page 367: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

64

Topics

• Introduction

• MPI Standard

• MPI-1.x Model and Basic Calls

• MPI Communicators

• Point to Point Communication

• Point to Point Communication in-depth

• Deadlock

• Trapezoidal Rule : A Case Study

• Using MPI & Jumpshot to profile parallel applications

• Summary – Materials for the Test

Page 368: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

Summary : Material for the Test

• Basic MPI – 10, 11, 12

• Communicators – 18, 19, 20

• Point to Point Communication – 24, 25, 26, 27, 28

• In-depth Point to Point Communication – 33, 34, 35, 36,

37, 38, 39, 40, 41, 42

• Deadlock – 44, 45, 46

65

Page 369: Paralel Computing

CSC 7600 Lecture 7 : MPI1 Spring 2011

66

Page 370: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

MESSAGE PASSING INTERFACE MPI

(PART B)

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityFebruary 10, 2011

Page 371: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 20112

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 372: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Review of Basic MPI Calls

• In review, the 6 main MPI calls:

– MPI_Init

– MPI_Finalize

– MPI_Comm_size

– MPI_Comm_rank

– MPI_Send

– MPI_Recv

• Include MPI Header file

– #include “mpi.h”

• Basic MPI Datatypes

– MPI_INT, MPI_FLOAT, ….

3

Page 373: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 20114

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 374: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Collective Calls

• A communication pattern that encompasses all processes within a communicator is known as

collective communication

• MPI has several collective communication calls, the most frequently used are:– Synchronization

• Barrier

– Communication

• Broadcast

• Gather & Scatter

• All Gather

– Reduction

• Reduce

• AllReduce

5

Page 375: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 20116

MPI Collective Calls : Barrier

Function: MPI_Barrier()

int MPI_Barrier (

MPI_Comm comm )

Description:Creates barrier synchronization in a

communicator group comm. Each process,

when reaching the MPI_Barrier call, blocks

until all the processes in the group reach the

same MPI_Barrier call.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Barrier.html

P0

P1

P2

P3

MP

I_B

arri

er()

P0

P1

P2

P3

Page 376: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example: MPI_Barrier()

7

#include <stdio.h>#include "mpi.h"

int main (int argc, char *argv[]){int rank, size, len;char name[MPI_MAX_PROCESSOR_NAME];MPI_Init(&argc, &argv);

MPI_Barrier(MPI_COMM_WORLD);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &len);

MPI_Barrier(MPI_COMM_WORLD);

printf ("Hello world! Process %d of %d on %s\n", rank, size, name);

MPI_Finalize();return 0;

}

[cdekate@celeritas collective]$ mpirun -np 8 barrierHello world! Process 0 of 8 on celeritas.cct.lsu.eduWriting logfile....Finished writing logfile.Hello world! Process 4 of 8 on compute-0-3.localHello world! Process 1 of 8 on compute-0-0.localHello world! Process 3 of 8 on compute-0-2.localHello world! Process 6 of 8 on compute-0-5.localHello world! Process 7 of 8 on compute-0-6.localHello world! Process 5 of 8 on compute-0-4.localHello world! Process 2 of 8 on compute-0-1.local[cdekate@celeritas collective]$

Page 377: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 20118

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 378: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 20119

MPI Collective Calls : Broadcast

Function: MPI_Bcast()

int MPI_Bcast (

void *message,

int count,

MPI_Datatype datatype,

int root,

MPI_Comm comm )

Description:A collective communication call where a single process sends the same data contained in the

message to every process in the communicator. By default a tree like algorithm is used to broadcast

the message to a block of processors, a linear algorithm is then used to broadcast the message from

the first process in a block to all other processes. All the processes invoke the MPI_Bcast call with the

same arguments for root and comm,

float endpoint[2]; ...

MPI_Bcast(endpoint, 2, MPI_FLOAT, 0, MPI_COMM_WORLD);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Bcast.html

AP0

P1

P2

P3

A

A

A

AP0

P1

P2

P3

Broadcast

Page 379: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201110

MPI Collective Calls : Scatter

Function: MPI_Scatter()

int MPI_Scatter (

void *sendbuf,

int send_count,

MPI_Datatype send_type,

void *recvbuf,

int recv_count,

MPI_Datatype recv_type,

int root,

MPI_Comm comm)

Description:MPI_Scatter splits the data referenced by the sendbuf on the process with rank root into p segments each

of which consists of send_count elements of type send_type. The first segment is sent to process0 and the

second segment to process1.The send arguments are significant on the process with rank root.

...

MPI_Scatter(&(local_A[0][0]), n/p, MPI_FLOAT, row_segment, n/p, MPI_FLOAT, 0,MPI_COMM_WORLD);

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Scatter.html

A, B, C, DP0

P1

P2

P3

B

C

D

AP0

P1

P2

P3

Scatter

local_A[][] row_segment

Page 380: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201111

MPI Collective Calls : Gather

Function: MPI_Gather()

int MPI_Gather (

void *sendbuf,

int send_count,

MPI_Datatype sendtype,

void *recvbuf,

int recvcount,

MPI_Datatype recvtype,

int root,

MPI_Comm comm )

Description:MPI_Gather collects the data referenced by sendbuf from each process in the communicator comm, and

stores the data in process rank order on the process with rank root in the location referenced by

recvbuf.The recv parameters are only significant.

...

MPI_Gather(local_x, n/p, MPI_FLOAT, global_x, n/p, MPI_FLOAT, 0, MPI_COMM_WORLD);

...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Gather.html

A, B, C, DP0

P1

P2

P3

B

C

D

AP0

P1

P2

P3

Gather

global_x local_x

Page 381: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201112

MPI Collective Calls : All Gather

Function: MPI_Allgather()

int MPI_Allgather (

void *sendbuf,

int send_count,

MPI_Datatype sendtype,

void *recvbuf,

int recvcount,

MPI_Datatype recvtype,

MPI_Comm comm )

Description:MPI_Allgather gathers the content from the send buffer (sendbuf) on each process. The effect of this call

is similar to executing MPI_Gather() p times with a different process acting as the root.for (root=0; root<p; root++)

MPI_Gather(local_x, n/p, MPI_FLOAT, global_x, n/p, MPI_FLOAT, root, MPI_COMM_WORLD);

...

CAN BE REPLACED WITH :

MPI_Allgather(local_x, local_n, MPI_FLOAT, global_x, local_n, MPI_FLOAT, MPI_COMM_WORLD);

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Allgather.html

B

C

D

AP0

P1

P2

P3

A, B, C, D

A, B, C, D

A, B, C, D

A, B, C, DP0

P1

P2

P3

All Gather

Page 382: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201113

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 383: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201114

MPI Collective Calls : ReduceFunction: MPI_Reduce()

int MPI_Reduce (

void *operand,

void *result,

int count,

MPI_Datatype datatype,

MPI_Op operator,

int root,

MPI_Comm comm )

Description:A collective communication call where all the processes in a communicator contribute data that is

combined using binary operations (MPI_Op) such as addition, max, min, logical, and, etc. MPI_Reduce

combines the operands stored in the memory referenced by operand using the operation operator and

stores the result in *result. MPI_Reduce is called by all the processes in the communicator comm and for

each of the processes count, datatype operator and root remain the same....

MPI_Reduce(&local_integral, &integral, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Reduce.html

B

C

D

AP0

P1

P2

P3

A+B+C+DP0

P1

P2

P3

Reduce

Binary Op= MPI_SUM

Page 384: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

MPI Binary Operations

15

• MPI binary operators are used in the MPI_Reduce function call as one of the

parameters. MPI_Reduce performs a global reduction operation (dictated by the MPI

binary operator parameter) on the supplied operands.

• Some of the common MPI Binary Operators used are :

Operation Name MeaningMPI_MAX MaximumMPI_MIN MinimumMPI_SUM SumMPI_PROD ProductMPI_LAND Logical AndMPI_BAND Bitwise AndMPI_LOR Logical OrMPI_BOR Bitwise OrMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Maximum and location of max.MPI_MINLOC Maximum and location of min.

MPI_Reduce(&local_integral,

&integral, 1, MPI_FLOAT,

MPI_SUM, 0, MPI_COMM_WORLD);

Page 385: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201116

MPI Collective Calls : All Reduce

Functi

on:

MPI_Allreduce()

int MPI_Allreduce (

void *sendbuf,

void *recvbuf,

int count,

MPI_Datatype datatype,

MPI_Op op,

MPI_Comm comm )

Description:MPI_Allreduce is used exactly like MPI_Reduce,

except that the result of the reduction is returned

on all processes, as a result there is no root

parameter....

MPI_Allreduce(&integral, &integral, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Allreduce.html

B

C

D

AP0

P1

P2

P3

A+B+C+D

A+B+C+D

A+B+C+D

A+B+C+DP0

P1

P2

P3

All Reduce

Binary Op= MPI_SUM

Page 386: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Parallel Trapezoidal Rule

Send, Recv

17

#include <stdio.h>#include "mpi.h”

main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float a = 0.0; /* Left endpoint */float b = 1.0; /* Right endpoint */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;

Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

float Trap(float local_a, float local_b, int local_n,float h); /* Calculate local integral */

/* Let the system do what it needs to start up MPI */MPI_Init(&argc, &argv);

/* Get my process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Find out how many processes are being used */MPI_Comm_size(MPI_COMM_WORLD, &p);

h = (b-a)/n; /* h is the same for all processes */local_n = n/p; /* So is the number of trapezoids */

/* Length of each process' interval of* integration = local_n*h. So my interval* starts at: */local_a = a + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h);

Page 387: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Parallel Trapezoidal Rule

Send, Recv

18

if (my_rank == 0) {total = integral;for (source = 1; source < p; source++) {

MPI_Recv(&integral, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);

total = total + integral;}

} else {

MPI_Send(&integral, 1, MPI_FLOAT, dest,tag, MPI_COMM_WORLD);

}if (my_rank == 0) {

printf("With n = %d trapezoids, our estimate\n",n);

printf("of the integral from %f to %f = %f\n",a, b, total);

}MPI_Finalize();} /* main */

Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

Page 388: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Parallel Trapezoidal Rule

Send, Recv

19

float Trap(float local_a /* in */,float local_b /* in */,int local_n /* in */,float h /* in */) {

float integral; /* Store result in integral */float x;int i;

float f(float x); /* function we're integrating */

integral = (f(local_a) + f(local_b))/2.0;x = local_a;for (i = 1; i <= local_n-1; i++) {

x = x + h;integral = integral + f(x);

}integral = integral*h;return integral;

} /* Trap */float f(float x) {

float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;

} /* f */Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4

Page 389: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201120

Flowchart for Parallel Trapezoidal RuleMASTER WORKERS

Initialize MPI Environment

Initialize MPI Environment

Initialize MPI Environment

… Initialize MPI Environment

Create Local Workload buffer (Variables etc)

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Isolate work regions

Isolate work regions Isolate work

regionsIsolate work

regionsCalculate

Sequential Trapezoid rule

for Local region

… Calculate Sequential

Trapezoid rule

for Local region

Calculate Sequential

Trapezoid rule

for Local region

Calculate Sequential

Trapezoid rule

for Local region

Integrate results for local workload

Recv. results from “workers”

Send result to “master”

Send result to “master”

Send result to “master”

Concatenate results to file

End

Calculate integral

Calculate integral

Calculate integral

Page 390: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Trapezoidal Rule :

with MPI_Bcast, MPI_Reduce

21

#include <stdio.h>#include <stdlib.h>

/* We'll be using MPI routines, definitions, etc. */#include "mpi.h"

main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float endpoint[2]; /* Left and right */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for */

/* my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;

float Trap(float local_a, float local_b, int local_n,float h); /* Calculate local integral */

MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Comm_size(MPI_COMM_WORLD, &p);

if (argc != 3) {if (my_rank==0)

printf("Usage: mpirun -np <numprocs> trapezoid <left> <right>\n");

MPI_Finalize();exit(0);

}

if (my_rank==0) {endpoint[0] = atof(argv[1]); /* left endpoint */endpoint[1] = atof(argv[2]); /* right endpoint */

}

MPI_Bcast(endpoint, 2, MPI_FLOAT, 0, MPI_COMM_WORLD);

Page 391: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Trapezoidal Rule :

with MPI_Bcast, MPI_Reduce

22

h = (endpoint[1]-endpoint[0])/n; /* h is the same for all processes */

local_n = n/p; /* so is the number of trapezoids */if (my_rank == 0) printf("a=%f, b=%f, Local number of

trapezoids=%d\n", endpoint[0], endpoint[1], local_n );

local_a = endpoint[0] + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h);

MPI_Reduce(&integral, &total, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);

if (my_rank == 0) {printf("With n = %d trapezoids, our estimate\n",

n);printf("of the integral from %f to %f = %f\n",

endpoint[0], endpoint[1], total);}

MPI_Finalize();} /* main */

float Trap(float local_a /* in */,float local_b /* in */,int local_n /* in */,float h /* in */) {

float integral; /* Store result in integral */float x;int i;

float f(float x); /* function we're integrating */

integral = (f(local_a) + f(local_b))/2.0;x = local_a;for (i = 1; i <= local_n-1; i++) {

x = x + h;integral = integral + f(x);

}integral = integral*h;return integral;

} /* Trap */

float f(float x) {float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;

} /* f */

Page 392: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Trapezoidal Rule :

with MPI_Bcast, MPI_Reduce

23

#!/bin/bash

#PBS -N name

#PBS -l walltime=120:00:00,nodes=2:ppn=4

cd /home/lsu00/Demos/l9/trapBcast

pwd

date

PROCS=`wc -l < $PBS_NODEFILE`

mpdboot --file=$PBS_NODEFILE

/usr/lib64/mpich2/bin/mpiexec -n $PROCS ./trapBcast 2 25 >>out.txt

mpdallexit

date

Page 393: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201124

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 394: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Constructing Datatypes

• Creating data structures in C:typedef struct {

. . .} STRUCT_NAME

• For example : In the numerical integration by trapezoidal rule we could create a data structure for storing the attributes of the problem as follows: typedef struct {

float a,float b,int n;

} DATA_INTEGRAL; . . .. . .DATA_INTEGRAL intg _data;

• What would happen when you use:

MPI_Bcast( &intg_data, 1, DATA_INTEGRAL, 0, MPI_COMM_WORLD);

25

ERROR!!! Intg_data is of the type DATA_INTEGRAL NOT an MPI_Datatype

Page 395: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Constructing MPI Datatypes

• MPI allows users to define derived MPI datatypes, using basic datatypes that build during execution time

• These derived data types can be used in the MPI communication calls, instead of the basic predefined datatypes.

• A sending process can pack noncontiguous data into contiguous buffer and send the buffered data to a receiving process that can unpack the contiguous buffer and store the data to noncontiguous location.

• A derived datatype is an opaque object that specifies :– A sequence of primitive datatypes

– A sequence of integer (byte) displacements

• MPI has several functions for constructing derived datatypes :– Contiguous

– Vector

– Indexed

– Struct

26

Page 396: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

MPI : Basic Data Types

(Review)

MPI datatype C datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

27

You can also define your own (derived datatypes), such as an array of ints of size 100, or more complex examples, such as a struct or an array of structs

Page 397: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201128

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 398: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201129

Derived Datatypes : Contiguous

Function: MPI_Type_contiguous()

int MPI_Type_contiguous(

int count,

MPI_Datatype old_type,

MPI_Datatype *new_type)

Description:This is the simplest constructor in the MPI derived datatypes. Contiguous datatype constructors create a new datatype by making count copies of existing data type (old_type)

MPI_Datatype rowtype;...

MPI_Type_contiguous(SIZE, MPI_FLOAT, &rowtype);

MPI_Type_commit(&rowtype);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Type_contiguous.html

Page 399: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example : Derived Datatypes - Contiguous

30

#include "mpi.h"#include <stdio.h>

#define SIZE 4

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;

float a[SIZE][SIZE] ={1.0, 2.0, 3.0, 4.0,5.0, 6.0, 7.0, 8.0,9.0, 10.0, 11.0, 12.0,13.0, 14.0, 15.0, 16.0};

float b[SIZE];

MPI_Status stat;

MPI_Datatype rowtype;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Type_contiguous(SIZE, MPI_FLOAT, &rowtype);MPI_Type_commit(&rowtype);if (numtasks == SIZE) {if (rank == 0) {

for (i=0; i<numtasks; i++){dest = i;

MPI_Isend(&a[i][0], 1, rowtype, dest, tag, MPI_COMM_WORLD, &req);

}}

MPI_Recv(b, SIZE, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf(“rank= %d b= %3.1f %3.1f %3.1f %3.1f\n”,

rank,b[0],b[1],b[2],b[3]);}

elseprintf(“Must specify %d processors. Terminating.\n”,SIZE);

MPI_Type_free(&rowtype);MPI_Finalize();}

Declares a 4x4 array of datatype float

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0

https://computing.llnl.gov/tutorials/mpi/

Homogenous datastructureof size 4 (Type : rowtype)

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0

Page 400: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example : Derived Datatypes - Contiguous

31

https://computing.llnl.gov/tutorials/mpi/

Page 401: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201132

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 402: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201133

Derived Datatypes : Vector

Function: MPI_Type_vector()

int MPI_Type_vector(

int count,

int blocklen,

int stride,

MPI_Datatype old_type,

MPI_Datatype *newtype )

Description:

Returns a new datatype that represents equally spaced blocks. The spacing between the start of each block is given in units of extent (oldtype). The count represents the number of blocks, blocklen details the number of elements in each block, stride represents the number of elements between start of each block of the old_type. The new datatype is stored in new_type

...

MPI_Type_vector(SIZE, 1, SIZE, MPI_FLOAT, &columntype);...

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Type_vector.html

Page 403: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example : Derived Datatypes - Vector

34

#include "mpi.h"#include <stdio.h>#define SIZE 4

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;float a[SIZE][SIZE] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,

13.0, 14.0, 15.0, 16.0};float b[SIZE];

MPI_Status stat;MPI_Datatype columntype;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Type_vector(SIZE, 1, SIZE, MPI_FLOAT, &columntype);MPI_Type_commit(&columntype);

if (numtasks == SIZE) {if (rank == 0) {

for (i=0; i<numtasks; i++)

MPI_Isend(&a[0][i], 1, columntype, i, tag, MPI_COMM_WORLD, &req);

}

MPI_Recv(b, SIZE, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d b= %3.1f %3.1f %3.1f %3.1f\n",

rank,b[0],b[1],b[2],b[3]);}

elseprintf("Must specify %d processors. Terminating.\n",SIZE);

MPI_Type_free(&columntype);MPI_Finalize();}

https://computing.llnl.gov/tutorials/mpi/

Declares a 4x4 array of datatype float

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0

Homogenous datastructure of size 4

(Type : columntype)

1.0

5.0

9.0

13.0

2.0

6.0

10.0

14.0

3.0

7.0

11.0

15.0

4.0

8.0

12.0

16.0

Page 404: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example : Derived Datatypes - Vector

35

https://computing.llnl.gov/tutorials/mpi/

Page 405: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201136

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 406: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201137

Derived Datatypes : Indexed

Function: MPI_Type_indexed()

int MPI_Type_indexed(

int count,

int *array_of_blocklengths,

int *array_of_displacements,

MPI_Datatype oldtype,

MPI_datatype *newtype);

Description:

Returns a new datatype that represents count blocks. Each block is defined by an entry in array_of_blocklengths and

array_of_displacements. Displacements are expressed in units of extent(oldtype). The count is the number of blocks

and the number of entries in array_of_displacements (displacement of each block in units of the oldtype) and

array_of_blocklengths (number of instances of oldtype in each block).

...

MPI_Type_indexed(2, blocklengths, displacements, MPI_FLOAT,

&indextype);...

https://computing.llnl.gov/tutorials/mpi/man/MPI_Type_indexed.txt

Page 407: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example : Derived Datatypes - Indexed

38

#include "mpi.h"#include <stdio.h>#define NELEMENTS 6

int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;int blocklengths[2], displacements[2];

float a[16] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0};

float b[NELEMENTS];

MPI_Status stat;MPI_Datatype indextype;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

blocklengths[0] = 4;blocklengths[1] = 2;displacements[0] = 5;displacements[1] = 12;

MPI_Type_indexed(2, blocklengths, displacements, MPI_FLOAT, &indextype);MPI_Type_commit(&indextype);

if (rank == 0) {for (i=0; i<numtasks; i++)

MPI_Isend(a, 1, indextype, i, tag, MPI_COMM_WORLD, &req);}

MPI_Recv(b, NELEMENTS, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d b= %3.1f %3.1f %3.1f %3.1f %3.1f %3.1f\n",

rank,b[0],b[1],b[2],b[3],b[4],b[5]);

MPI_Type_free(&indextype);MPI_Finalize();}

https://computing.llnl.gov/tutorials/mpi/

Declares a [16][1] array of type float

1.0 2.0 3.0 4.0 5.0 6.0 7.08.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0

Creates a new datatype indextype

1.0 2.0 3.0 4.0 5.0 6.07.08.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0

Page 408: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example : Derived Datatypes - Indexed

39

https://computing.llnl.gov/tutorials/mpi/

Page 409: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201140

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 410: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201141

Derived Datatypes : struct

Function: MPI_Type_struct()

int MPI_Type_struct(

int count,

int *array_of_blocklengths,

MPI_Aint *array_of_displacements,

MPI_Datatype *array_of_types,

MPI_datatype *newtype);

Description:

Returns a new datatype that represents count blocks. Each is defined by an entry in array_of_blocklengths,

array_of_displacements and array_of_types. Displacements are expressed in bytes. count is an integer that specifies

the number of blocks (number of entries in arrays. The array_of_blocklengths is the number of elements in each

blocks & array_of_displacements specifies the byte displacement of each block. The array_of_types parameter

comprising each block is made of concatenation of type array_of_types.

...

MPI_Type_struct(2, blockcounts, offsets, oldtypes, &particletype);...

https://computing.llnl.gov/tutorials/mpi/man/MPI_Type_struct.txt

Page 411: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example : Derived Datatype - struct

42

#include "mpi.h"#include <stdio.h>#define NELEM 25int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, I;typedef struct {float x, y, z;float velocity;int n, type;} Particle;

Particle p[NELEM], particles[NELEM];MPI_Datatype particletype, oldtypes[2]; int blockcounts[2];

/* MPI_Aint type used to be consistent with syntax of *//* MPI_Type_extent routine */MPI_Aint offsets[2], extent;

MPI_Status stat;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

/* Setup description of the 4 MPI_FLOAT fields x, y, z, velocity */offsets[0] = 0;oldtypes[0] = MPI_FLOAT;blockcounts[0] = 4;

MPI_Type_extent(MPI_FLOAT, &extent);offsets[1] = 4 * extent;oldtypes[1] = MPI_INT;blockcounts[1] = 2;MPI_Type_struct(2, blockcounts, offsets, oldtypes, &particletype);MPI_Type_commit(&particletype);

if (rank == 0) {for (i=0; i<NELEM; i++) {

particles[i].x = i * 1.0;particles[i].y = i * -1.0;particles[i].z = i * 1.0; particles[i].velocity = 0.25;particles[i].n = i;particles[i].type = i % 2;

}for (i=0; i<numtasks; i++)

MPI_Send(particles, NELEM, particletype, i, tag, MPI_COMM_WORLD);}

MPI_Recv(p, NELEM, particletype, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d %3.2f %3.2f %3.2f %3.2f %d %d\n", rank,p[3].x,

p[3].y,p[3].z,p[3].velocity,p[3].n,p[3].type);

MPI_Type_free(&particletype);MPI_Finalize();}

https://computing.llnl.gov/tutorials/mpi/

Declaring the structure of the heterogeneous datatype Float, Float, Float, Float, Int, Int

Construct the heterogeneous datatype as an MPI datatype using Struct

Populate the heterogenous MPI datatype with heterogeneous data

Page 412: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201143

https://computing.llnl.gov/tutorials/mpi/

Example : Derived Datatype - struct

Page 413: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201144

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 414: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201145

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Vector Multiplication

Page 415: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201146

Matrix Vector Multiplication

where A is an n x m matrix and B is a vector of size m and C is a vector of size n.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Multiplication of a matrix, A and a vector B, produces the vector Cwhose elements, ci (0 <= i < n), are computed as follows:

Ci Aik *Bkk0

km

Page 416: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201147

Matrix-Vector Multiplicationc = A x b

Page 417: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example: Matrix-Vector Multiplication

(DEMO)

48

#include "mpi.h"#include <stdio.h>#include <stdlib.h>

#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 1 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */

int main (int argc, char *argv[]){int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

Define the dimensions of the Matrix a([4][4]) and Vector b([4][1])

Page 418: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201149

double a[NRA][NCA], /* Matrix A to be multiplied */b[NCA][NCB], /* Vector B to be multiplied */c[NRA][NCB]; /* result Vector C */

MPI_Status status;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2 ) {printf("Need at least two MPI tasks. Quitting...\n");

MPI_Abort(MPI_COMM_WORLD, rc);exit(1);}

numworkers = numtasks-1;/**************************** master task ************************************/

if (taskid == MASTER){

printf("mpi_mm has started with %d tasks.\n",numtasks);printf("Initializing arrays...\n");

for (i=0; i<NRA; i++)for (j=0; j<NCA; j++)

a[i][j]= i+j;for (i=0; i<NCA; i++)

for (j=0; j<NCB; j++)b[i][j]= (i+1)*(j+1);

Example: Matrix-Vector Multiplication

Declare the matrix , vector to be multiplied and the resultant vector

MASTER Initializes the Matrix A :0.00 1.00 2.00 3.00 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00

MASTER Initializes B :1.00 2.00 3.00 4.00

Page 419: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201150

for (i=0; i<NRA; i++){

printf("\n"); for (j=0; j<NCA; j++)

printf("%6.2f ", a[i][j]);}

for (i=0; i<NRA; i++){

printf("\n"); for (j=0; j<NCB; j++)

printf("%6.2f ", b[i][j]);}

/* Send matrix data to the worker tasks */

averow = NRA/numworkers;extra = NRA%numworkers;offset = 0;mtype = FROM_MASTER;for (dest=1; dest<=numworkers; dest++){

rows = (dest <= extra) ? averow+1 : averow;

printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,

MPI_COMM_WORLD);MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);

offset = offset + rows;}

Example: Matrix-Vector Multiplication

Load Balancing : Dividing the Matrix A based on the number of processors

MASTER sends Matrix A to workers :PROC[0] :: 0.00 1.00 2.00 3.00 PROC[1] :: 1.00 2.00 3.00 4.00 PROC[2] :: 2.00 3.00 4.00 5.00 PROC[3] :: 3.00 4.00 5.00 6.00

MASTER Sends Vector B to Workers:PROC[0] :: 1.00 2.00 3.00 4.00 PROC[1] :: 1.00 2.00 3.00 4.00 PROC[2] :: 1.00 2.00 3.00 4.00 PROC[3] :: 1.00 2.00 3.00 4.00

Page 420: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example: Matrix-Vector Multiplication

51

/* Receive results from worker tasks */mtype = FROM_WORKER;for (i=1; i<=numworkers; i++){

source = i;

MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,

MPI_COMM_WORLD, &status);printf("Received results from task %d\n",source);

}

/* Print results */printf("******************************************************\n");printf("Result Matrix:\n");for (i=0; i<NRA; i++){

printf("\n"); for (j=0; j<NCB; j++)

printf("%6.2f ", c[i][j]);}printf("\n******************************************************\n");printf ("Done.\n");

}

The Master process gathers the results and populates the result matrix in the correct order (easily done in this case because matrix index I is used to indicate position in result array)

Page 421: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example: Matrix-Vector Multiplication

52

/**************************** worker task ************************************/if (taskid > MASTER){

mtype = FROM_MASTER;

MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++)for (i=0; i<rows; i++){

c[i][k] = 0.0;for (j=0; j<NCA; j++)

c[i][k] = c[i][k] + a[i][j] * b[j][k];}

mtype = FROM_WORKER;MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);

}MPI_Finalize();

}

Worker Processes receive workloadProc[1] A : 1.00 2.00 3.00 4.00Proc[1] B : 1.00 2.00 3.00 4.00

Calculate ResultProc[1] C : 1.00 + 4.00 + 9.00 + 16.00

Calculate ResultProc[1] C : 1.00 + 4.00 + 9.00 + 16.00

Page 422: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Example: Matrix-Vector Multiplication

(Results)

53

Page 423: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201154

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 424: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201155

MPI Profiling : MPI_Wtime

Function: MPI_Wtime()

double MPI_Wtime()

Description:

Returns time in seconds elapsed on the calling processor. Resolution of time scale

is determined by the MPI environment variable MPI_WTICK. When the MPI

environment variable MPI_WTIME_IS_GLOBAL is defined and set to true, the the

value of MPI_Wtime is synchronized across all processes in MPI_COMM_WORLD

double time0;...

time0 = MPI_Wtime();...

printf("Hello From Worker #%d %lf \n", rank, (MPI_Wtime() – time0));

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Wtime.html

Page 425: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Timing Example: MPI_Wtime

56

#include <stdio.h>

#include ”mpi.h”

main(int argc, char **argv)

{

int size, rank;

double time0, time1;MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

time0 = MPI_Wtime();

if(rank==0)

{

printf(" Hello From Proc0 Time = %lf \n", (MPI_Wtime() – time0));}

else

{

printf("Hello From Worker #%d %lf \n", rank, (MPI_Wtime() – time0));}

MPI_Finalize();

}

Page 426: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201157

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 427: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Additional Topics

• Additional topics not yet covered :

– Communication Topologies

– Profiling using Tau (to be covered with PAPI & Parallel

Algorithms)

– Profiling using PMPI (to be covered with PAPI & Parallel

Algorithms)

– Debugging MPI programs

58

Page 428: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201159

Topics

• MPI Collective Calls: Synchronization Primitives

• MPI Collective Calls: Communication Primitives

• MPI Collective Calls: Reduction Primitives

• Derived Datatypes: Introduction

• Derived Datatypes: Contiguous

• Derived Datatypes: Vector

• Derived Datatypes: Indexed

• Derived Datatypes: Struct

• Matrix-Vector multiplication : A Case Study

• MPI Profiling calls

• Additional Topics

• Summary Materials for Test

Page 429: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 2011

Summary : Material for the Test

• Collective calls

– Barrier (6) , Broadcast (9), Scatter(10), Gather(11), Allgather(12)

– Reduce(14), Binary operations (15), All Reduce (16)

• Derived Datatypes (25,26,27)

– Contiguous (29,30,31)

– Vector (33,34,35)

60

Page 430: Paralel Computing

CSC 7600 Lecture 8 : MPI2

Spring 201161

Page 431: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

SMP NODES

Prof.  Thomas  Sterling  Department  of  Computer  Science  Louisiana  State  University  February  15,  2011  

Page 432: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

2  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 433: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

3  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 434: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

4  

Opening Remarks

•  This week is about supercomputer architecture –  Last time: end of cooperative computing –  Today: capability computing with modern microprocessor and

multicore SMP node

•  As we’ve seen, there is a diversity of HPC system types •  Most common systems are either SMPs or are

ensembles of SMP nodes •  “SMP” stands for: “Symmetric Multi-Processor” •  System performance is strongly influenced by SMP node

performance •  Understanding structure, functionality, and operation of

SMP nodes will allow effective programming

Page 435: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

5  

The take-away message

•  Primary structure and elements that make up an SMP node

•  Primary structure and elements that make up the modern multicore microprocessor component

•  The factors that determine microprocessor delivered performance

•  The factors that determine overall SMP sustained performance

•  Amdahl’s law and how to use it •  Calculating cpi •  Reference: J. Hennessy & D. Patterson, “Computer Architecture

A Quantitative Approach” 3rd Edition, Morgan Kaufmann, 2003

Page 436: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

6  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 437: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

7  

SMP Context

•  A standalone system –  Incorporates everything needed for

•  Processors •  Memory •  External I/O channels •  Local disk storage •  User interface

–  Enterprise server and institutional computing market •  Exploits economy of scale to enhance performance to cost •  Substantial performance

–  Target for ISVs (Independent Software Vendors) •  Shared memory multiple thread programming platform

–  Easier to program than distributed memory machines –  Enough parallelism to fully employ system threads (processor cores)

•  Building block for ensemble supercomputers –  Commodity clusters –  MPPs

Page 438: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

8  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 439: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

9  

Performance: Amdahl’s Law

Baton  Rouge  to  Houston  •     from  my  house  on  East  Lakeshore  Dr.  •     to  downtown  HyaT  Regency  •     distance  of  271  •     in  air  flight  Vme:  1  hour  •     door  to  door  Vme  to  drive:  4.5  hours  •     cruise  speed  of  Boeing  737:  600  mph  •     cruise  speed  of  BMW  528:  60  mph  

Page 440: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

10  

Amdahl’s Law: drive or fly? •  Peak performance gain: 10X

–  BMW cruise approx. 60 MPH –  Boeing 737 cruise approx. 600 MPH

•  Time door to door –  BMW

•  Google estimates 4 hours 30 minutes –  Boeing 737

•  Time to drive to BTR from my house = 15 minutes •  Wait time at BTR = 1 hour •  Taxi time at BTR = 5 minutes •  Continental estimates BTR to IAH 1 hour •  Taxi time at IAH = 15 minutes (assuming gate available) •  Time to get bags at IAH = 25 minutes •  Time to get rental car = 15 minutes •  Time to drive to Hyatt Regency from IAH = 45 minutes •  Total time = 4.0 hours

•  Sustained performance gain: 1.125X

Page 441: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

11  

Amdahl’s Law

( )

( )

⎟⎟⎠

⎞⎜⎜⎝

⎛+−

=

×⎟⎟⎠

⎞⎜⎜⎝

⎛+×−

=

×⎟⎟⎠

⎞⎜⎜⎝

⎛+×−=

=

=

gff

S

TgfTf

TS

TgfTfT

TTfTTS

SfgTTT

OO

O

OOA

OF

AO

F

A

O

1

1

1

1

appliedon acceleratin with computatio of up speed daccelerate be n tocomputatio daccelerate-non offraction

ncomputatio ofportion dacceleratefor gain eperformancpeak daccelerate becan n that computatio ofportion of time

ncomputatio dacceleratefor timencomputatio daccelerate-nonfor time

start   end  

TO  

TF  

start   end  

TA  

TF/g  

Page 442: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

12  

Amdahl’s Law and Parallel Computers

•  Amdahl’s Law (FracX: original % to be speed up) Speedup = 1 / [(FracX/SpeedupX) + (1-FracX)]

•  A portion is sequential => limits parallel speedup –  Speedup <= 1/ (1-FracX)

•  Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used

80 = 1 / [(FracX/100) + (1-FracX)] 0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1 FracX = (80-1)/79.2 = 0.9975 •  Only 0.25% sequential!

Page 443: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

13  

Amdahl’s Law with Overhead

( )

( )

( )O

OO

O

A

O

OOA

n

ii

n

iFiF

Tvn

gff

S

vnTgfTf

TTTS

vnTgfTfT

vV

v

tT

×++−

=

×+×+×−==

×+×+×−=

=≡

=

1

1

1

1

workdacceleratefor overhead total

segment work daccelerate of overhead

start   end  

TO  

tF  

start   end  

TA  

v  +  tF/g  

tF   tF  tF  

Page 444: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

14  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 445: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

15  

SMP Node Diagram

MP  L1  L2  

MP  L1  L2  

L3  

MP  L1  L2  

MP  L1  L2  

L3  

M1   M2   Mn  

Controller  

S  

S  

NIC   NIC  USB  Peripherals  

JTAG  

Legend  :    MP  :  MicroProcessor  L1,L2,L3  :  Caches  M1,  M2,  …  :  Memory  Banks  S  :  Storage  NIC  :  Network  Interface  Card  

Ethernet  

PCI-­‐e  

Page 446: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

16  

SMP System Examples

Vendor  &  name  Processor   Number  of    cores  

Cores  per  proc.  

Memory   Chipset   PCI  slots  

IBM  eServer  p5  595  

IBM  Power5  1.9  GHz  

64   2   2  TB   Proprietary  GX+,  RIO-­‐2  

≤240  PCI-­‐X  (20  standard)  

Microway  QuadPuter-­‐8  

AMD  Opteron  2.6  Ghz  

16   2   128  GB   Nvidia  nForce  Pro  2200+2050  

6  PCIe  

Ion  M40   Intel  Itanium  2  1.6  GHz  

8   2   128  GB   Hitachi  CF-­‐3e   4  PCIe  2  PCI-­‐X  

Intel  Server  System  SR870BN4  

Intel  Itanium  2  1.6  GHz  

8   2   64  GB   Intel  E8870   8  PCI-­‐X  

HP  Proliant  ML570  G3  

Intel  Xeon  7040  3  GHz  

8   2   64  GB   Intel  8500   4  PCIe  6  PCI-­‐X  

Dell  PowerEdge  2950  

Intel  Xeon  5300  2.66  GHz  

8   4   32  GB   Intel  5000X   3  PCIe    

Page 447: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

17  

Sample SMP Systems

DELL  PowerEdge  

HP  Proliant  

Intel    Server  System  

IBM  p5  595  

Microway  Quadputer  

Page 448: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

18  

HyperTransport-based SMP System

Source:  hTp://www.devx.com/amd/ArVcle/17437    

Page 449: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

19  

Comparison of Opteron and Xeon SMP Systems

Source:  hTp://www.devx.com/amd/ArVcle/17437    

Page 450: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

20  

Multi-Chip Module (MCM) Component of IBM Power5 Node

20  

Page 451: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

21  

Major Elements of an SMP Node •  Processor chip •  DRAM main memory cards •  Motherboard chip set •  On-board memory network

–  North bridge •  On-board I/O network

–  South bridge •  PCI industry standard interfaces

–  PCI, PCI-X, PCI-express •  System Area Network controllers

–  e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch •  System Management network

–  Usually Ethernet –  JTAG for low level maintenance

•  Internal disk and disk controller •  Peripheral interfaces

Page 452: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

22  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 453: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

23  

FPU  IA-­‐32  Control  

Instr.  Fetch  &  Decode   Cache  

Cache  

TLB  

Integer  Units  

IA-­‐64  Control  

Bus  

Core  Processor  Die   4  x  1MB  L3  cache    

Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00)

Page 454: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

24  

Multicore Microprocessor Component Elements

•  Multiple processor cores –  One or more processors

•  L1 caches –  Instruction cache –  Data cache

•  L2 cache –  Joint instruction/data cache –  Dedicated to individual core processor

•  L3 cache –  Not all systems –  Shared among multiple cores –  Often off die but in same package

•  Memory interface –  Address translation and management (sometimes) –  North bridge

•  I/O interface –  South bridge

Page 455: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

25  

Comparison of Current Microprocessors

Processor   Clock  rate   Caches  (per  core)  

ILP  (each  core)  

Cores  per  chip  

Process  &  die  size  

Power   Linpack  TPP  (one  core)  

AMD  Opteron   2.6  GHz   L1I:  64KB  L1D:  64KB  L2:  1MB  

2  FPops/cycle  3  Iops/cycle  2*  LS/cycle  

2   90nm,  220mm2  

95W   3.89  Gflops  

IBM  Power5+   2.2  GHz   L1I:  64KB  L1D:  32KB  L2:  1.875MB  L3:  18MB  

4  FPops/cycle  2  Iops/cycle  2  LS/cycle  

2   90nm,  243mm2  

180W  (est.)   8.33  Gflops  

Intel  Itanium  2  (9000  series)  

1.6  GHz   L1I:  16KB  L1D:  16KB  L2I:  1MB  L2D:  256KB  L3:  3MB  or  more  

4  FPops/cycle  4  Iops/cycle  2  LS/cycle  

2   90nm,  596mm2  

104W   5.95  Gflops  

Intel  Xeon  Woodcrest  

3  GHz   L1I:  32KB  L1D:  32KB  L2:  2MB  

4  Fpops/cycle  3  Iops/cycle  1L+1S/cycle  

2   65nm,  144mm2  

80W   6.54  Gflops  

Page 456: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

26  

Processor Core Micro Architecture

•  Execution Pipeline –  Stages of functionality to process issued instructions –  Hazards are conflicts with continued execution –  Forwarding supports closely associated operations exhibiting

precedence constraints •  Out of Order Execution

–  Uses reservation stations –  hides some core latencies and provide fine grain asynchronous

operation supporting concurrency •  Branch Prediction

–  Permits computation to proceed at a conditional branch point prior to resolving predicate value

–  Overlaps follow-on computation with predicate resolution –  Requires roll-back or equivalent to correct false guesses –  Sometimes follows both paths, and several deep

Page 457: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

27  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 458: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

28  

Recap: Who Cares About the Memory Hierarchy?

µProc  60%/yr.  (2X/1.5yr)  

DRAM  9%/yr.  (2X/10  yrs)  

1!

10!

100!

1000!

1980

!19

81!

1983

!19

84!

1985

!19

86!

1987

!19

88!

1989

!19

90!

1991

!19

92!

1993

!19

94!

1995

!19

96!

1997

!19

98!

1999

!20

00!

DRAM  

CPU!

1982

!

Processor-­‐Memory  Performance  Gap:  (grows  50%  /  year)  

Performan

ce  

Time  

“Moore’s  Law”  

Processor-­‐DRAM  Memory  Gap  (latency)  

Copyright  2001,  UCB,  David  PaTerson  

Page 459: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011   29  

What is a cache? •  Small, fast storage used to improve average access time to slow

memory. •  Exploits spatial and temporal locality •  In computer architecture, almost everything is a cache!

–  Registers: a cache on variables –  First-level cache: a cache on second-level cache –  Second-level cache: a cache on memory –  Memory: a cache on disk (virtual memory) –  TLB :a cache on page table –  Branch-prediction: a cache on prediction information

Proc/Regs  

L1-­‐Cache  

L2-­‐Cache  

Memory  

Disk,  Tape,  etc.  

Bigger   Faster  

Copyright  2001,  UCB,  David  PaTerson  

Page 460: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

30  

Levels of the Memory Hierarchy

CPU  Registers  100s  Bytes  <  0.5  ns  (typically  1  CPU  cycle)    

Cache  L1  cache:  10s-­‐100s  K  Bytes  1-­‐5  ns  $10/  Mbyte  

Main  Memory  Few  G  Bytes  50ns-­‐  150ns  $0.02/  MByte  

Disk  100s-­‐1000s  G  Bytes    500000ns-­‐  1500000ns  $  0.25/  GByte  

Capacity  Access  Time  Cost  

Tape  infinite  sec-­‐min  $0.0014/  MByte    

Registers  

Cache  

Memory  

Disk  

Tape  

Instr.  Operands  

Blocks  

Pages  

Files  

Staging  Xfer  Unit  

prog./compiler  1-­‐8  bytes  

cache  cntl  8-­‐128  bytes  

OS  512-­‐4K  bytes  

user/operator  Mbytes  

Upper  Level  

Lower  Level  

faster  

Larger  

Copyright  2001,  UCB,  David  PaTerson  

Page 461: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

31  

Cache Measures

•  Hit rate: fraction found in that level –  So high that usually talk about Miss rate

•  Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)

•  Miss penalty: time to replace a block from lower level, including time to replace in CPU

–  access time: time to lower level = f(latency to lower level)

–  transfer time: time to transfer block =f(BW between upper & lower levels)

Copyright  2001,  UCB,  David  PaTerson  

Page 462: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

32  

Memory Hierarchy: Terminology •  Hit: data appears in some block in the upper level (example:

Block X) –  Hit Rate: the fraction of memory accesses found in the upper level –  Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss •  Miss: data needs to be retrieved from a block in the lower level

(Block Y) –  Miss Rate = 1 - (Hit Rate) –  Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block to the processor •  Hit Time << Miss Penalty (500 instructions on 21264!)

Lower Level Memory Upper Level

Memory To Processor

From Processor Blk X

Blk Y

Copyright  2001,  UCB,  David  PaTerson  

Page 463: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

Cache Performance

33  

MEMcount

MEMALU

count

ALU

MEMALUcount

cyclecount

CPIIICPI

IICPI

IIITCPIIT

×⎟⎟⎠

⎞⎜⎜⎝

⎛+×⎟⎟

⎞⎜⎜⎝

⎛=

+=

××=

T  =  total  execuVon  Vme  Tcycle  =  Vme  for  a  single  processor  cycle  Icount  =  total  number  of  instrucVons  IALU  =  number  of  ALU  instrucVons  (e.g.  register  –  register)  IMEM  =  number  of  memory  access  instrucVons  (  e.g.  load,  store)  CPI  =  average  cycles  per  instrucVons  CPIALU  =  average  cycles  per  ALU  instrucVons    

CPIMEM  =  average  cycles  per  memory  instrucVon  rmiss  =  cache  miss  rate  rhit  =  cache  hit  rate  CPIMEM-­‐MISS  =  cycles  per  cache  miss  CPIMEM-­‐HIT=cycles  per  cache  hit  MALU  =  instrucVon  mix  for  ALU  instrucVons  MMEM  =  instrucVon  mix  for  memory  access  instrucVon  

Page 464: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

Cache Performance

34  

( )( )[ ] cycleMEMMEMALUALUcount

MEMMEMALUALU

MEMALU

count

MEMMEM

count

ALUALU

TCPIMCPIMITCPIMCPIMCPI

MMIIM

IIM

nMixInstructio

××+××=

×+×=

=+

=

=

)()(

1

:

T  =  total  execuVon  Vme  Tcycle  =  Vme  for  a  single  processor  cycle  Icount  =  total  number  of  instrucVons  IALU  =  number  of  ALU  instrucVons  (e.g.  register  –  register)  IMEM  =  number  of  memory  access  instrucVons  (  e.g.  load,  store)  CPI  =  average  cycles  per  instrucVons  CPIALU  =  average  cycles  per  ALU  instrucVons    

CPIMEM  =  average  cycles  per  memory  instrucVon  rmiss  =  cache  miss  rate  rhit  =  cache  hit  rate  CPIMEM-­‐MISS  =  cycles  per  cache  miss  CPIMEM-­‐HIT=cycles  per  cache  hit  MALU  =  instrucVon  mix  for  ALU  instrucVons  MMEM  =  instrucVon  mix  for  memory  access  instrucVon  

Page 465: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

Cache Performance

35  

T  =  total  execuVon  Vme  Tcycle  =  Vme  for  a  single  processor  cycle  Icount  =  total  number  of  instrucVons  IALU  =  number  of  ALU  instrucVons  (e.g.  register  –  register)  IMEM  =  number  of  memory  access  instrucVons  (  e.g.  load,  store)  CPI  =  average  cycles  per  instrucVons  CPIALU  =  average  cycles  per  ALU  instrucVons    

CPIMEM  =  average  cycles  per  memory  instrucVon  rmiss  =  cache  miss  rate  rhit  =  cache  hit  rate  CPIMEM-­‐MISS  =  cycles  per  cache  miss  CPIMEM-­‐HIT=cycles  per  cache  hit  MALU  =  instrucVon  mix  for  ALU  instrucVons  MMEM  =  instrucVon  mix  for  memory  access  instrucVon  

( ) ( )( )[ ] cycleMISSMEMMISSHITMEMMEMALUALUcount

MISSMEMMISSHITMEMMEM

TCPIrCPIMCPIMITCPIrCPICPI

××+×+××=

×+=

−−

−−

Page 466: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

Cache Performance: Example

36  

1100

5.01102

1010

11

=

=

=

=

×=

=

HITMEM

MISSMEM

cycle

ALU

MEM

count

CPICPI

nsTCPIII

2.010102

8.0108

10108

108

11

10

11

10

10

==

==×

==

×=−=

count

MEMMEM

count

ALUALU

MEMcountALU

IIM

IIM

III

sec150105))112.0()18.0((10

11100)9.01(1

9.0

1011

=

×××+××=

=×−+=

×+=

=

−−−−

A

MISSMEMAMISSHITMEMAMEM

hitA

T

CPIrCPICPIr

sec550105))512.0()18.0((10

51100)5.01(1

5.0

1011

=

×××+××=

=×−+=

×+=

=

−−−−

B

MISSMEMBMISSHITMEMBMEM

hitB

T

CPIrCPICPIr

Page 467: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

37  

Page 468: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

38  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 469: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

39  

Motherboard Chipset

•  Provides core functionality of motherboard •  Embeds low-level protocols to facilitate efficient communication between

local components of computer system •  Controls the flow of data between the CPU, system memory, on-board

peripheral devices, expansion interfaces and I/O susbsystem •  Also responsible for power management features, retention of non-volatile

configuration data and real-time measurement •  Typically consists of:

–  Northbridge (Memory Controller Hub, MCH), managing traffic between the processor, RAM, GPU, southbridge and optionally PCI Express slots

–  Southbridge (I/O Controller Hub, ICH), coordinating slower set of devices, including traditional PCI bus, ISA bus, SMBus, IDE (ATA), DMA and interrupt controllers, real-time clock, BIOS memory, ACPI power management, LPC bridge (providing fan control, floppy disk, keyboard, mouse, MIDI interfaces, etc.), and optionally Ethernet, USB, IEEE1394, audio codecs and RAID interface

Page 470: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

40  

Major Chipset Vendors

•  Intel –  http://developer.intel.com/products/chipsets/index.htm

•  Via –  http://www.via.com.tw/en/products/chipsets

•  SiS –  http://www.sis.com/products/product_000001.htm

•  AMD/ATI –  http://ati.amd.com/products/integrated.html

•  Nvidia –  http://www.nvidia.com/page/mobo.html

Page 471: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

41  

Chipset Features Overview

Page 472: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

42  

Motherboard

•  Also referred to as main board, system board, backplane •  Provides mechanical and electrical support for pluggable

components of a computer system •  Constitutes the central circuitry of a computer,

distributing power and clock signals to target devices, and implementing communication backplane for data exchanges between them

•  Defines expansion possibilities of a computer system through slots accommodating special purpose cards, memory modules, processor(s) and I/O ports

•  Available in many form factors and with various capabilities to match particular system needs, housing capacity and cost

Page 473: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

43  

Motherboard Form Factors

•  Refer to standardized motherboard sizes •  Most popular form factor used today is ATX, evolved

from now obsolete AT (Advanced Technology) format •  Examples of other common form factors:

–  MicroATX, miniaturized version of ATX –  WTX, large form factor designated for use in high power

workstations/servers featuring multiple processors –  Mini-ITX, designed for use in thin clients –  PC/104 and ETX, used in embedded systems and single

board computers –  BTX (Balanced Technology Extended), introduced by Intel as

a possible successor to ATX

Page 474: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

44  

Motherboard Manufacturers

•  Abit  •  Albatron  •  Aopen  •  ASUS  •  Biostar  •  DFI  •  ECS  •  Epox  •  FIC  •  Foxconn  •  Gigabyte  

•  IBM  •  Intel  •  Jetway  •  MSI  •  ShuTle  •  Soyo  •  SuperMicro  •  Tyan  •  VIA  

Page 475: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

45  

Source:  hTp://www.motherboards.org  

Populated CPU Socket

Page 476: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

46  

Source:  hTp://www.motherboards.org  

DIMM Memory Sockets

Page 477: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

47  

Motherboard on Arete

Page 478: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

48  

Source:  hTp://www.tyan.com  

SuperMike Motherboard: Tyan Thunder i7500 (S720)

Page 479: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

49  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 480: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

50  

PCI enhanced systems

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1

Page 481: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

51  

PCI-express

Lane width

Clock speed

Throughput (duplex, bits)

Throughput (duplex, bytes)

Initial expected uses

x1 2.5 GHz 5 Gbps 400 MBps Slots, Gigabit Ethernet

x2 2.5 GHz 10 Gbps 800 MBps

x4 2.5 GHz 20 Gbps 1.6 GBps Slots, 10 Gigabit Ethernet, SCSI, SAS

x8 2.5 GHz 40 Gbps 3.2 GBps

x16 2.5 GHz 80 Gbps 6.4 GBps Graphics adapters

http://www.redbooks.ibm.com/abstracts/tips0456.html

Page 482: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

52  

PCI-X Bus Width Clock Speed Features Bandwidth

PCI-X 66 64 Bits 66 MHz Hot Plugging, 3.3 V 533 MB/s

PCI-X 133 64 Bits 133 MHz Hot Plugging, 3.3 V 1.06 GB/s

PCI-X 266

64 Bits, optional 16 Bits only

133 MHz Double Data Rate

Hot Plugging, 3.3 & 1.5 V, ECC supported 2.13 GB/s

PCI-X 533

64 Bits, optional 16 Bits only

133 MHz Quad Data Rate

Hot Plugging, 3.3 & 1.5 V, ECC supported 4.26 GB/s

Page 483: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

53  

Bandwidth Comparisons CONNECTION BITS BYTES

PCI 32-bit/33 MHz 1.06666 Gbit/s 133.33 MB/s

PCI 64-bit/33 MHz 2.13333 Gbit/s 266.66 MB/s

PCI 32-bit/66 MHz 2.13333 Gbit/s 266.66 MB/s

PCI 64-bit/66 MHz 4.26666 Gbit/s 533.33 MB/s

PCI 64-bit/100 MHz 6.39999 Gbit/s 799.99 MB/s

PCI Express (x1 link)[6] 2.5 Gbit/s 250 MB/s

PCI Express (x4 link)[6] 10 Gbit/s 1 GB/s

PCI Express (x8 link)[6] 20 Gbit/s 2 GB/s PCI Express (x16 link)[6] 40 Gbit/s 4 GB/s

PCI Express 2.0 (x32 link)[6] 80 Gbit/s 8 GB/s

PCI-X DDR 16-bit 4.26666 Gbit/s 533.33 MB/s

PCI-X 133 8.53333 Gbit/s 1.06666 GB/s

PCI-X QDR 16-bit 8.53333 Gbit/s 1.06666 GB/s

PCI-X DDR 17.066 Gbit/s 2.133 GB/s

PCI-X QDR 34.133 Gbit/s 4.266 GB/s

AGP 8x 17.066 Gbit/s 2.133 GB/s

Page 484: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

54  

HyperTransport : Context

•  Northbridge-Southbridge device connection facilitates communication over fast processor bus between system memory, graphics adaptor, CPU

•  Southbridge operates several I/O interfaces, through the Northbridge operating over another proprietary connection

•  This approach is potentially limited by the emerging bandwidth demands over inadequate I/O buses

•  HyperTransport is one of the many technologies aimed at improving I/O.

•  High data rates are achieved by using enhanced, low-swing, 1.2 V Low Voltage Differential Signaling (LVDS) that employs fewer pins and wires consequently reducing cost and power requirements.

•  HyperTransport also helps in communication between multiple AMD Opteron CPUs

http://www.amd.com/us-en/Processors/ComputingSolutions/0,,30_288_13265_13295%5E13340,00.html

Page 485: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

55  

Hyper-Transport (continued) •  Point-to-point parallel topology uses 2

unidirectional links (one each for upstream and downstream)

•  HyperTransport technology chunks data into packets to reduce overhead and improve efficiency of transfers.

•  Each HyperTransport technology link also contains 8-bit data path that allows for insertion of a control packet in the middle of a long data packet, thus reducing latency.

•  In Summary : “HyperTransport™ technology delivers the raw throughput and low latency necessary for chip-to-chip communication. It increases I/O bandwidth, cuts down the number of different system buses, reduces power consumption, provides a flexible, modular bridge architecture, and ensures compatibility with PCI. “

http://www.amd.com/us-en/Processors/ComputingSolutions /0,,30_288_13265_13295%5E13340,00.html

Page 486: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

56  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 487: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

57  

Performance Issues

•  Cache behavior –  Hit/miss rate –  Replacement strategies

•  Prefetching •  Clock rate •  ILP •  Branch prediction •  Memory

–  Access time –  Bandwidth

Page 488: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

58  

Topics

•  Introduction •  SMP Context •  Performance: Amdahl’s Law •  SMP System structure •  Processor core •  Memory System •  Chip set •  South Bridge – I/O •  Performance Issues •  Summary – Material for the Test

Page 489: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

59  

Summary – Material for the Test

•  Please make sure that you have addressed all points outlined on slide 5

•  Understand content on slide 7 •  Understand concepts, equations, problems on

slides 11, 12, 13 •  Understand content on 21, 24, 26, 29 •  Understand concepts on slides 32,33,34,35,36 •  Understand content on slides 39, 57

•  Required reading material :

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1

Page 490: Paralel Computing

CSC  7600  Lecture  9  :  SMP  Nodes      Spring  2011  

60  

Page 491: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

Pthreads

Prof. Thomas Sterling Department of Computer Science Louisiana State University February 22, 2011

Page 492: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

2

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 493: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

3

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 494: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Opening Remarks

•  We now have a good picture of supercomputer architecture –  including SMP structures

•  which are the building blocks of most HPC systems on the Top-500 List

•  We were introduced to the first two programming methods for exploiting parallelism –  Capacity Computing - Condor –  Co-operative Computing - MPI

•  Now we explore a 3rd programming model: multithreaded computing on shared memory systems –  This time: general principles and POSIX Pthreads –  Next time: OpenMP

4

Page 495: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

What you’ll Need to Know

•  Modeling time to execution with CPI •  Multi-thread programming and execution concepts

–  Parallelism with multiple threads –  Synchronization –  Memory consistency models

•  Basic Pthread commands •  Dangers

–  Race conditions –  Deadlock

5

Page 496: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

6

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 497: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

7

CPI

rate miss cache nsinstructiomemory for timeexecution

nsinstructioregister for timeexecution timeexecution

timecycle nsinstructiomemory executed ofnumber

nsinstructioregister executed ofnumber nsinstructio executed ofnumber

penalty) (miss miss cache with operationsmemory for cpi hit cache with operationsmemory for cpi

operationsmemory for cpi operationsregister for cpi

n instructioper cycles

miss

M

R

c

M

R

Mmiss

Mhit

M

R

rTTTt

I#I#I#

cpicpicpicpicpi

Page 498: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

8

CPI (continued)

( )( )( )( ) cMmissmissMhitmissMRR

MmissmissMhitmissM

MRMMRR

MM

RR

c

tcpircpirmcpimI# Tcpircpircpi

mmcpimcpimcpiI#I#m

I#I#mtcpiI#T

××+×−×+××=

×+×−=

=+×+×=

××=

11

0.1 where

Page 499: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

An Example

Robert  hates  parallel  compu;ng  and  runs  all  of  his  jobs  on  a  single  processor  core  on  his  Acme  computer.  His  current  applica;on  plays  solitaire  because  he  is  too  lazy  to  flip  the  cards  himself.  The  machine  he  is  running  on  has  a  2  GHz  clock.  For  this  problem  the  basic  register  opera;ons  make  up  only  75%  of  the  instruc;on  mix  but  delivers  one  and  a  half  instruc;ons  per  cycle  while  the  load  and  store  opera;ons  yield  one  per  cycle.  But  his  cache  hit  rate  is  only  80%  and  the  average  penalty  for  not  finding  data  in  the  L1  cache  is  120  nanoseconds.  A  counter  on  the  Acme  processor  tells  Robert  that  it  takes  approximately  16  billion  instruc;on  execu;ons  to  run  his  short  program.  How  long  does  it  take  to  execute  Robert’s  applica;on?  

9

Page 500: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

And the answer is …

( )

( ) ( )

( ) ( )( )( ) seconds 6.1017.1281052.125.0106.1

1052402.018.025.03/275.0106.1

25.0cycles 240ns 120cycles/ns 2

13/2

2.018.0snanosecond 5.0GHz 0.2_

000,000,000,16#

1010

1010

=×=××+×=

×××+××+××=

=⇒=

=×=

=

=

=⇒−==

=⇒=

=

TT

m0.75mcpicpicpi

rrrtrateclock

I

MR

Mmiss

Mhit

R

missmisshit

c

10

Page 501: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

11

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 502: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Address Space

Thread 1

Address Space

global data

UNIX Processes vs. Multithreaded Programs

12

exec. state

stack

PID  

text  

Address Space

global data

exec. state

stack

PID1  

text  

Copy of PID1’s Address Space

global data

exec. state

stack

PID2  

text  

fork()

shared data

exec. state

stack

PID  

text  

private data

Thread 2

exec. state

stack

private data

thread create

Thread m

Standard  UNIX  process  

(single-­‐threaded)   New  process  spawned  via  fork()   Mul;threaded  Applica;on  

Page 503: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

13

Anatomy of a Thread

 Thread  (or,  more  precisely:  thread  of  execu-on)  is  typically  described  as  a  lightweight  process.  There  are,  however,  significant  differences  in  the  way  standard  processes  and  threads  are  created,  how  they  interact  and  access  resources.  Many  aspects  of  these  are  implementa;on  dependent.  

 Private  state  of  a  thread  includes:  •  Execu;on  state  (instruc;on  pointer,  registers)  •  Stack  •  Private  variables  (typically  allocated  on  thread’s  stack)  

Threads  share  access  to  global  data  in  applica;on’s  address  space.  

Page 504: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

14

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 505: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

15

Race Conditions

Example:  consider  the  following  piece  of  pseudo-­‐code  to  be  executed  concurrently  by  threads  T1  and  T2  (the  ini;al  value  of  memory  loca;on  A  is  x)  

A→R: read memory location A into register R R++: increment register R A←R: write R into memory location A

Scenario  1:    Step  1)  T1:(A→R)  →  T1:R=x  Step  2)  T1:(R++)  →  T1:R=x+1  Step  3)  T1:(A←R)  →  T1:A=x+1  Step  4)  T2:(A→R)  →  T2:R=x+1  Step  5)  T2:(R++)  →  T2:R=x+2  Step  6)  T2:(A←R)  →  T2:A=x+2  

Scenario  2:    Step  1)  T1:(A→R)  →  T1:R=x  Step  2)  T2:(A→R)  →  T2:R=x  Step  3)  T1:(R++)  →  T1:R=x+1  Step  4)  T2:(R++)  →  T2:R=x+1  Step  5)  T1:(A←R)  →  T1:A=x+1  Step  6)  T2:(A←R)  →  T2:A=x+1  

Since  threads  are  scheduled  arbitrarily  by  an  external  en;ty,  the  lack  of  explicit  synchroniza;on  may  cause  different    outcomes.  

Race condition (or race hazard) is a flaw in system or process whereby the output of the system or process is unexpectedly and critically dependent on the sequence or timing of other events.

Suggested  reading:  hdp://en.wikipedia.org/wiki/Race_condi;on    

Page 506: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Critical Sections

16

Critical section is a segment of code accessing a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution.

Suggested  reading:  hdp://en.wikipedia.org/wiki/Cri;cal_sec;on    

critical section

 The  implementa;on  of  cri;cal  sec;on  must  prevent  any  change  of  processor  control  once  the  execu;on  enters  the  cri;cal  sec;on.    

•  Code  on  uniprocessor  systems  may  rely  on  disabling  interrupts  and  avoiding  system  calls  leading  to  context  switches,  restoring  the  interrupt  mask  to  the  previous  state  upon  exit  from  the  cri;cal  sec;on  

•  General  solu;ons  rely  on  synchroniza;on  mechanisms  (hardware-­‐assisted  when  possible),  discussed  on  the  next  slides  

Page 507: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Thread Synchronization Mechanisms

•  Based on atomic memory operation (require hardware support) –  Spinlocks –  Mutexes (and condition variables) –  Semaphores –  Derived constructs: monitors, rendezvous, mailboxes, etc.

•  Shared memory based locking –  Dekker’s algorithm

http://en.wikipedia.org/wiki/Dekker%27s_algorithm

–  Peterson’s algorithm http://en.wikipedia.org/wiki/Peterson%27s_algorithm

–  Lamport’s algorithm http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm http://research.microsoft.com/users/lamport/pubs/bakery.pdf

17

Page 508: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Spinlocks

•  Spinlock is the simplest kind of lock, where a thread waiting for the lock to become available repeatedly checks lock’s status

•  Since the thread remains active, but doesn’t perform a useful computation, such a lock is essentially busy-waiting, and hence generally wasteful

•  Spinlocks are desirable in some scenarios: –  If the waiting time is short, spinlocks save the overhead and cost of context

switches, required if other threads have to be scheduled instead –  In real-time system applications, spinlocks offer good and predictable

response time

•  Typically use fair scheduling of threads to work correctly •  Spinlock implementations require atomic hardware primitives,

such as test-and-set, fetch-and-add, compare-and-swap, etc.

18

Suggested  reading:  hdp://en.wikipedia.org/wiki/Spinlock        

Page 509: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Mutexes

•  Mutex (abbreviation for mutual exclusion) is an algorithm used to prevent concurrent accesses to a common resource. The name also applies to the program object which negotiates access to that resource.

•  Mutex works by atomically setting an internal flag when a thread (mutex owner) enters a critical section of the code. As long as the flag is set, no other threads are permitted to enter the section. When the mutex owner completes operations within the critical section, the flag is (atomically) cleared.

19

Suggested  reading:  hdp://en.wikipedia.org/wiki/Mutex      

lock(mutex) critical section unlock(mutex)

Page 510: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Condition Variables •  Condition variables are frequently used in association with mutexes to increase

the efficiency of execution in multithreaded environments •  Typical use involves a thread or threads waiting for a certain condition (based on

the values of variables inside the critical section) to occur. Note that: –  The thread cannot wait inside the critical section, since no other thread would be

permitted to enter and modify the variables –  The thread could monitor the values by repeatedly accessing the critical section

through its mutex; such a solution is typically very wasteful •  Condition variable permits the waiting thread to temporarily release the mutex it

owns, and provide the means for other threads to communicate the state change within the critical section to the waiting thread (if such a change occurred)

20

/* waiting thread code: */ lock(mutex); /* check if you can progress */ while (condition not true) wait(cond_var); /* now you can; do your work */ ... unlock(mutex);

/* modifying thread code: */ lock(mutex); /* update critical section variables */ ... /* announce state change */ signal(cond_var); unlock(mutex);

Page 511: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Semaphores •  Semaphore is a protected variable introduced by Edsger Dijkstra (in the “THE”

operating system) and constitutes the classic method for restricting access to shared resource

•  It is associated with an integer variable (semaphore’s value) and a queue of waiting threads

•  Semaphore can be accessed only via the atomic P and V primitives:

•  Usage: –  Semaphore’s value S.v is initialized to a positive number –  Semaphore’s queue S.q is initially empty –  Entrance to critical section is guarded by P(S) –  When exiting critical section, V(S) is invoked –  Note: mutex can be implemented as a binary semaphore

21

P(semaphore S) { if S.v > 0 then S.v := S.v-1; else { insert current thread in S.q; change its state to blocked; schedule another thread; } }

V(semaphore S) { if S.v = 0 and not empty(S.q) then { pick a thread T from S.q; change T’s state to ready; } else S.v := S.v+1; }

Suggested  reading:  hdp://www.mcs.drexel.edu/~shartley/OSusingSR/semaphores.html    hdp://en.wikipedia.org/wiki/Semaphore_(programming)    

Page 512: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Disadvantages of Locks

•  Blocking mechanism (forces threads to wait) •  Conservative (lock has to be acquired when there’s only a

possibility of access conflict) •  Vulnerable to faults and failures (what if the owner of the lock

dies?) •  Programming is difficult and error prone (deadlocks, starvation) •  Does not scale with problem size and complexity •  Require balancing the granularity of locked data against the cost

of fine-grain locks •  Not composable •  Suffer from priority inversion and convoying •  Difficult to debug

22

Reference:  hdp://en.wikipedia.org/wiki/Lock_(computer_science)    

Page 513: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

23

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 514: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

24

Shared Memory Consistency Model

•  Defines memory functionality related to read and write operations by multiple processors –  Determines the order of read values in response to the order of

write values by multiple processors –  Enables the writing of correct, efficient, and repeatable shared

memory programs •  Establishes a formal discipline that places restrictions on

the values that can be returned by a read in a shared-memory program execution –  Avoids non-determinacy in memory behavior –  Provides a programmer perspective on expected behavior –  Imposes demands on system memory operation

•  Two general classes of consistency models: –  Sequential consistency –  Relaxed consistency

Page 515: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

25

Sequential Consistency Model

•  Most widely adopted memory model •  Required:

–  Maintaining program order among operations from individual processors

–  Maintaining a single sequential order among operations from all processors

•  Enforces effect of atomic complex memory operations –  Enables compound atomic operations –  Avoids race conditions –  Precludes non-determinacy from dueling processors

Page 516: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

26

Relaxed Consistency Models

•  Sequential consistency over-constrains parallel execution limiting parallel performance and scalability –  Critical sections impose sequential bottlenecks –  Amdahl’s Law applies imposing upper bound on performance

•  Relaxed consistency models permit optimizations not possible under limitations of sequential consistency

•  Forms of relaxed consistency –  Program order

•  Write to read •  Write to write •  Read to following read or write

–  Write atomicity •  Read value of its own previous write prior to being visible to all

other processors

Page 517: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

27

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 518: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Dining Philosophers Problem

28

Description: •  N philosophers (N > 3) spend their time eating and thinking at the round table •  There are N plates and N forks (or chopsticks, in some versions) between the plates •  Eating requires two forks, which may be picked one at a time, at each side of the plate •  When any of the philosophers is done eating, he starts thinking •  When a philosopher becomes hungry, he attempts to start eating •  They do it in complete silence as to not disturb each other (hence no communication to synchronize their actions is possible)

A  varia;on  on  Edsger  Dijkstra’s  five  computers  compe;ng  for  access  to  five  shared  tape  drives  problem  (introduced  in  1971),  retold  by  Tony  Hoare.  

Problem: How must they acquire/release forks to ensure that each of them maintains a healthy balance between meditation and eating?

Page 519: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

What Can Go Wrong at the Philosophers Table?

•  Deadlock If all philosophers decide to eat at the same time and pick forks at the same side of their plates, they are stuck forever waiting for the second fork.

•  Livelock Livelock frequently occurs as a consequence of a poorly thought out deadlock prevention strategy. Assume that all philosophers: (a) wait some length of time to put down the fork they hold after noticing that they are unable to acquire the second fork, and then (b) wait some amount of time to reacquire the forks. If they happen to get hungry at the same time and pick one fork using scenario leading to a deadlock and all (a) and (b) timeouts are set to the same value, they won’t be able to progress (even though there is no actual resource shortage).

•  Starvation There may be at least one philosopher unable to acquire both forks due to timing issues. For example, his neighbors may alternately keep picking one of the forks just ahead of him and take advantage of the fact that he is forced to put down the only fork he was able to get hold of due to deadlock avoidance mechanism.

29

Page 520: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

30

Priority Inversion

•  How it happens: –  A low priority thread locks the mutex for some shared resource –  A high priority thread requires access to the same resource (waits for the

mutex) –  In the meantime, a medium priority thread (not depending on the common

resource) gets scheduled, preempting the low priority thread and thus preventing it from releasing the mutex

•  A classic occurrence of this phenomenon lead to system reset and subsequent loss of data in Mars Pathfinder mission in 1997: http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder.html

Priority inversion is the scenario where a low priority thread holds a shared resource that is required by a high priority thread.

Suggested  reading:  hdp://en.wikipedia.org/wiki/Priority_inversion  

Page 521: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

31

Spurious Wakeups

•  Spurious wakeup is a phenomenon associated with a thread waiting on a condition variable

•  In most cases, such a thread is supposed to return from call to wait() only if the condition variable has been signaled or broadcast

•  Occasionally, the waiting thread gets unblocked unexpectedly, either due to thread implementation performance trade-offs, or scheduler deficiencies

•  Lesson: upon exit from wait(), test the predicate to make sure the waiting thread indeed may proceed (i.e., the data it was waiting for have been provided). The side effect is a more robust code.

Suggested  reading:  hdp://en.wikipedia.org/wiki/Spurious_wakeup    

Page 522: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Thread Safety A code is thread-safe if it functions correctly during simultaneous execution by multiple threads.

•  Indicators helpful in determining thread safety –  How the code accesses global variables and heap –  How it allocates and frees resources that have global limits –  How it performs indirect accesses (through pointers or handles) –  Are there any visible side effects

•  Achieving thread safety

–  Re-entrancy: property of code, which may be interrupted during execution of one task, reentered to perform another, and then resumed on its original task without undesirable effects

–  Mutual exclusion: accesses to shared data are serialized to ensure that only one thread performs critical state update. Acquire locks in an identical order on all threads

–  Thread-local storage: as much of the accessed data as possible should be placed in thread’s private variables

–  Atomic operations: should be the preferred mechanism of use when operating on shared state

32

Page 523: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

33

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 524: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Common Approaches to Thread Implementation

•  Kernel threads •  User-space threads •  Hybrid implementations

34

References: 1. POSIX Threads on HP-UX 11i, http://devresource.hp.com/drc/resources/pthread_wp_jul2004.pdf 2. SunOS Multi-thread Architecture by M. L. Powell, S. R. Kleinman, et al. http://opensolaris.org/os/project/muskoka/doc_attic/mt_arch.pdf

Page 525: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Kernel Threads

•  Also referred to as Light Weight Processes •  Known to and individually managed by the kernel •  Can make system calls independently •  Can run in parallel on a multiprocessor (map directly onto

available execution hardware) •  Typically have wider range of scheduling capabilities •  Support preemptive multithreading natively •  Require kernel support and resources •  Have higher management overhead

35

Page 526: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

User-space Threads

•  Also known as fibers or coroutines •  Operate on top of kernel threads, mapped to them via user-space

scheduler •  Thread manipulations (“context switches”, etc.) are performed entirely

in user space •  Usually scheduled cooperatively (i.e., non-preemptively), complicating

the application code due to inclusion of explicit processor yield statements

•  Context switches cost less (on the order of subroutine invocation) •  Consume less resources than kernel threads; their number can be

consequently much higher without imposing significant overhead •  Blocking system calls present a challenge and may lead to inefficient

processor usage (user-space scheduler is ignorant of the occurrence of blocking; no notification mechanism exists in kernel either)

36

Page 527: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

MxN Threading

•  Available on NetBSD , HPUX an Solaris to complement the existing 1x1 (kernel threads only) and Mx1 (multiplexed user threads) libraries

•  Multiplex M lightweight user-space threads on top of N kernel threads, M > N (sometimes M >> N)

•  User threads are unbound and scheduled on Virtual Processors (which in turn execute on kernel threads); user thread may effectively move from one kernel thread to another in its lifetime

•  In some implementations Virtual Processors rely on the concept of Scheduler Activations to deal with the issue of user-space threads blocking during system calls

37

Page 528: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

38

Scheduler Activations •  Developed in 1991 at the University of Washington •  Typically used in implementations involving user-space threads •  Require kernel cooperation in form of a lightweight upcall mechanism to

communicate blocking and unblocking events to the user-space scheduler

Reference: T. Anderson, B. Bershad, E. Lazowska and H. Levy, Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism, http://www.cs.washington.edu/homes/bershad/Papers/p53-anderson.pdf

–  Unbound user threads are scheduled on Virtual Processors (which in turn execute on kernel threads) –  A user thread may effectively move from one kernel thread to another in its lifetime –  Scheduler Activation resembles and is scheduled like a kernel thread –  Scheduler Activation provides its replacement to the user-space scheduler when the unbound thread invokes a blocking operation in the kernel –  The new Scheduler Activation continues the operations of the same VP

Page 529: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

39

Examples of Multi-Threaded System Implementations

•  The most commonly used thread package on Linux is Native POSIX Thread Library (NPTL)

–  Requires kernel version 2.6 –  1x1 model, mapping each application thread to a kernel thread –  Bundled by default with recent versions of glibc –  High-performance implementation –  POSIX (Pthreads) compliant

•  Most of the prominent operating systems feature their own thread implementations, for example:

–  FreeBSD: three thread libraries, each supporting different execution model (user-space, 1x1, MxN with scheduler activations)

–  Solaris: kernel-level execution through LWPs (Lightweight Processes); user threads execute in context of LWPs and are controlled by system library

–  HPUX: Pthreads compliant MxN implementation –  MS Windows: threads as smallest kernel-level execution objects, fibers as smallest user-

level execution objects controlled by the programmer; many-to-many scheduling supported •  There are numerous open-source thread libraries (mostly for Linux): LinuxThreads,

GNU Pth, Bare-Bone Threads, FSU Pthreads, DCEthreads, Nthreads, CLthreads, PCthreads, LWP, QuickThreads, Marcel, etc.

Page 530: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

40

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 531: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

POSIX Threads (Pthreads) •  POSIX Threads define POSIX standard for multithreaded API (IEEE POSIX

1003.1-1995) •  The functions comprising core functionality of Pthreads can be divided into

three classes: –  Thread management –  Mutexes –  Condition variables

•  Pthreads define the interface using C language types, function prototypes and macros

•  Naming conventions for identifiers: –  pthread_: Threads themselves and miscellaneous subroutines –  pthread_attr_: Thread attributes objects –  pthread_mutex_: Mutexes –  pthread_mutexattr_: Mutex attributes objects –  pthread_cond_: Condition variables –  pthread_condattr_: Condition attributes objects –  pthread_key_: Thread-specific data keys

41

References: 1. http://www.llnl.gov/computing/tutorials/pthreads/ 2. http://www.opengroup.org/onlinepubs/007908799/xsh/pthread.h.html

Page 532: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Programming with Pthreads The scope of this short tutorial is: •  General thread management •  Synchronization

–  Mutexes –  Condition variables

•  Miscellaneous functions

42

Page 533: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

43

Pthreads: Thread Creation

Function: pthread_create()

int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*routine)(void *), void *arg); Description: Creates a new thread within a process. The created thread starts execution of routine, which is passed a pointer argument arg. The attributes of the new thread can be specified through attr, or left at default values if attr is null. Successful call returns 0 and stores the id of the new thread in location pointed to by thread, otherwise an error code is returned.

#include <pthread.h> ... void *do_work(void *input_data) { /* this is thread’s starting routine */ ... } ... pthread_t id; struct {. . .} args = {. . .}; /* struct containing thread arguments */ int err; ... /* create new thread with default attributes */ err = pthread_create(&id, NULL, do_work, (void *)&args); if (err != 0) {/* handle thread creation failure */} ...

Page 534: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

44

Pthreads: Thread Join

Function: pthread_join()

int pthread_join(pthread_t thread, void **value_ptr);

Description: Suspends the execution of the calling thread until the target thread terminates (either by returning from its startup routine, or calling pthread_exit()), unless the target thread already terminated. If value_ptr is not null, the return value from the target thread or argument passed to pthread_exit() is made available in location pointed to by value_ptr. When pthread_join() returns successfully (i.e. with zero return code), the target thread has been terminated.

#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... void *result_ptr; int err; ... /* create worker thread */ pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_join(id, &result_ptr); if (err != 0) {/* handle join error */} else {/* the worker thread is terminated and result_ptr points to its return value */ ... }

Page 535: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

45

Pthreads: Thread Exit Function: pthread_exit()

void pthread_exit(void *value_ptr);

Description: Terminates the calling thread and makes the value_ptr available to any successful join with the terminating thread. Performs cleanup of local thread environment by calling cancellation handlers and data destructor functions. Thread termination does not release any application visible resources, such as mutexes and file descriptors, nor does it perform any process-level cleanup actions.

#include <pthread.h> ... void *do_work(void *args) { ... pthread_exit(&return_value); /* the code following pthread_exit is not executed */ ... } ... void *result_ptr; pthread_t id; pthread_create(&id, NULL, do_work, (void *)&args); ... pthread_join(id, &result); /* result_ptr now points to return_value */ ...

Page 536: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

46

Pthreads: Thread Termination

Function: pthread_cancel()

void pthread_cancel(thread_t thread);

Description: The pthread_cancel() requests cancellation of thread thread. The ability to cancel a thread is dependent on its state and type.

#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... pthread_t id; int err; pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_cancel(id); if (err != 0) {/* handle cancelation failure */} ...

Page 537: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

47

Pthreads: Detached Threads

Function: pthread_detach()

int pthread_detach(pthread_t thread);

Description: Indicates to the implementation that storage for thread thread can be reclaimed when the thread terminates. If the thread has not terminated, pthread_detach() is not going to cause it to terminate. Returns zero on success, error number otherwise.

#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... pthread_t id; int err; ... /* start a new thread */ pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_detach(id); if (err != 0) {/* handle detachment failure */} else {/* master thread doesn’t join the worker thread; the worker thread resources will be released automatically after it terminates */ ... }

Page 538: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

48

Pthreads: Operations on Mutex Objects (I)

#include <pthread.h> ... pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; ... /* lock the mutex before entering critical section */ pthread_mutex_lock(&mutex); /* critical section code */ ... /* leave critical section and release the mutex */ pthread_mutex_unlock(&mutex); ...

Function: pthread_mutex_lock(), pthread_mutex_unlock()

int pthread_mutex_lock(pthread_mutex_t *mutex); int pthread_mutex_unlock(pthread_mutex_t *mutex); Description: The mutex object referenced by mutex shall be locked by calling pthread_mutex_lock(). If the mutex is already locked, the calling thread blocks until the mutex becomes available. After successful return from the call, the mutex object referenced by mutex is in locked state with the calling thread as its owner. The mutex object referenced by mutex is released by calling pthread_mutex_unlock(). If there are threads blocked on the mutex, scheduling policy decides which of them shall acquire the released mutex.

Page 539: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

49

Pthreads: Operations on Mutex Objects (II)

#include <pthread.h> ... pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; int err; ... /* attempt to lock the mutex */ err = pthread_mutex_trylock(&mutex); switch (err) { case 0: /* lock acquired; execute critical section code and release mutex */ ... pthread_mutex_unlock(&mutex); break; case EBUSY: /* someone already owns the mutex; do something else instead of blocking */ ... break; default: /* some other failure */ ... break; }

Function: pthread_mutex_trylock()

int pthread_mutex_trylock(pthread_mutex_t *mutex);

Description: The function pthread_mutex_trylock() is equivalent to pthread_mutex_lock() , except that if the mutex object is currently locked, the call returns immediately with an error code EBUSY. The value of 0 (success) is returned only if the mutex has been acquired.

Page 540: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Pthread Mutex Types

•  Normal –  No deadlock detection on attempts to relock already locked mutex

•  Error-checking –  Error returned when locking a locked mutex

•  Recursive –  Maintains lock count variable –  After the first acquisition of the mutex, the lock count is set to one –  After each successful relock, the lock count is increased; after each

unlock, it is decremented –  When the lock count drops to zero, thread loses the mutex

ownership •  Default

–  Attempts to lock the mutex recursively result in an undefined behavior

–  Attempts to unlock the mutex which is not locked, or was not locked by the calling thread, results in undefined behavior

50

Page 541: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

51

Pthreads: Condition Variables Function: pthread_cond_wait(),

pthread_cond_signal(), pthread_cond_broadcast() int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex); int pthread_cond_signal(pthread_cond_t *cond); Int pthread_cond_broadcast(pthread_cond_t *cond); Description: The pthread_cond_wait() blocks on a condition variable associated with a mutex. The function must be called with a locked mutex argument. It atomically releases the mutex and causes the calling thread to block. While in that state, another thread is permitted to access the mutex. Subsequent mutex release should be announced by the accessing thread through pthread_cond_signal() or pthread_cond_broadcast(). Upon successful return from pthread_cond_wait(), the mutex is in locked state with the calling thread as its owner. The pthread_cond_signal() unblocks at least one of the threads that are blocked on the specified condition variable cond. The pthread_cond_broadcast() unblocks all threads currently blocked on the specified condition variable cond. All of these functions return zero on successful completion, or an error code otherwise.

Page 542: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

52

Example: Condition Variable

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; /* create default mutex */ pthread_cond_t cond = PTHREAD_COND_INITIALIZER; /* create default condition variable */ pthread_t prod_id, cons_id; item_t buffer; /* storage buffer (shared access) */ int empty = 1; /* buffer empty flag (shared access) */ ... pthread_create(&prod_id, NULL, producer, NULL); /* start producer thread */ pthread_create(&cons_id, NULL, consumer, NULL); /* start consumer thread */ ...

void *producer(void *none) { while (1) { /* obtain next item, asynchronously */ item_t item = compute_item(); pthread_mutex_lock(&mutex); /* critical section starts here */ while (!empty) /* wait until buffer is empty */ pthread_cond_wait(&cond, &mutex); /* store item, update status */ buffer = item; empty = 0; /* wake waiting consumer (if any) */ pthread_condition_signal(&cond); /* critical section done */ pthread_mutex_unlock(&mutex); } }

void *consumer(void *none) { while (1) { item_t item; pthread_mutex_lock(&mutex); /* critical section starts here */ while (empty) /* block (nothing in buffer yet) */ pthread_cond_wait(&cond, &mutex); /* grab item, update buffer status */ item = buffer; empty = 1; /* critical section done */ pthread_condition_signal(&cond); pthread_mutex_unlock(&mutex); /* process item, asynchronously */ consume_item(item); } }

Ini;aliza;on  and  startup  

Simple  producer  thread   Simple  consumer  thread  

Page 543: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

53

Pthreads: Dynamic Initialization

Function: pthread_once()

int pthread_once(pthread_once_t *control, void (*init_routine)(void));

Description: The first call to pthread_once() by any thread in a process will call the init_routine() with no arguments. Subsequent calls to pthread_once() with the same control will not call init_routine().

#include <pthread.h> ... pthread_once init_ctrl = PTHREAD_ONCE_INIT; ... void initialize() {/* initialize global variables */} ... void *do_work(void *arg) { /* make sure global environment is set up */ pthread_once(&init_ctrl, initialize); /* start computations */ ... } ... pthread_t id; pthread_create(&id, NULL, do_work, NULL); ...

Page 544: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

54

Pthreads: Get Thread ID

Function: pthread_self()

pthread_t pthread_self(void);

Description: Returns the thread ID of the calling thread.

#include <pthread.h> ... pthread_t id; id = pthread_self(); ...

Page 545: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

55

Topics

•  Introduction •  Performance: CPI and memory behavior •  Overview of threaded execution model •  Programming with threads: basic concepts •  Shared memory consistency models •  Pitfalls of multithreaded programming •  Thread implementations: approaches and issues •  Pthreads: concepts and API •  Summary

Page 546: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

Summary – Material for the Test

•  Performance & cpi: slide 8 •  Multi thread concepts: 13, 16, 18, 19, 22, 24, 31 •  Thread implementations: 35 – 37 •  Pthreads: 43 – 45, 48

56  

Page 547: Paralel Computing

CSC  7600  Lecture  11  :  Pthreads    Spring  2011  

57  

Page 548: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

OPENMP

Prof. Thomas Sterling

Department of Computer Science

Louisiana State University

February 24, 2011

Page 549: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

2

Page 550: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

3

Page 551: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Where are we? (Take a deep breath …)

• 3 classes of parallel/distributed computing

– Capacity

– Capability

– Cooperative

• 3 classes of parallel architectures (respectively)

– Loosely coupled clusters and workstation farms

– Tightly coupled vector, SIMD, SMP

– Distributed memory MPPs (and some clusters)

• 3 classes of parallel execution models (respectively)

– Workflow, throughput, SPMD (ssh)

– Multithreaded with shared memory semantics (Pthreads)

– Communicating Sequential Processes (sockets)

• 3 classes of programming models

– Condor (Segment 1)

– OpenMP (Segment 3)

– MPI (Segment 2)

You Are Here

4

Page 552: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

HPC Modalities

5

Modalities Degree of Integration

Architectures Execution Models

Programming Models

Capacity Loosely Coupled Clusters & Workstation farms

Workflow Throughput

Condor

Capability Tightly Coupled Vectors, SMP, SIMD

Shared Memory Multithreading

OpenMP

Cooperative Medium DM MPPs & Clusters

CSP MPI

Page 553: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

6

Page 554: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

7

Amdahl’s Law

g

ff

S

Tg

fTf

TS

Tg

fTfT

TTf

TTS

S

f

g

T

T

T

OO

O

OOA

OF

AO

F

A

O

1

1

1

1

appliedon acceleratin with computatio of up speed

daccelerate be n tocomputatio daccelerate-non offraction

ncomputatio ofportion dacceleratefor gain eperformancpeak

daccelerate becan n that computatio ofportion of time

ncomputatio dacceleratefor time

ncomputatio daccelerate-nonfor time

start end

TO

TF

start end

TA

TF/g

Page 555: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Performance : Caches & Locality

• Temporal Locality is a property that if a program accesses a

memory location, there is a much higher than random probability

that the same location would be accessed again.

• Spatial Locality is a property that if a program accesses a

memory location, there is a much higher than random probability

that the nearby locations would be accessed soon.

• Spatial locality is usually easier to achieve than temporal locality

• A couple of key factors affect the relationship between locality

and scheduling :

– Size of dataset being processed by each processor

– How much reuse is present in the code processing a chunk of

iterations.

8

Page 556: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Performance Shared Memory (OpenMP): Key

Factors

• Load Balancing :

– mapping workloads with thread scheduling

• Caches :

– Write-through

– Write-back

• Locality :

– Temporal Locality

– Spatial Locality

• How Locality affects scheduling algorithm selection

• Synchronization :

– Effect of critical sections on performance

9

Page 557: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Performance : Caches & Locality

• Caches (Review) :

– for a C statement :

• a[i] = b[i]+c[i]

– the system accesses the memory locations referenced by b[i] and c[i] to the

processor, the result of the computation is subsequently stored in the memory

location referenced by a[i]

• Write-through caches: When a user writes some data, the data is immediately

written back to the memory, thus maintaining the cache-memory consistency.

In write through caches data in caches always reflect the data in the memory.

One of the main issues in write through caches is the increase in system

overhead required due to moving of large data between cache and memory.

• Write-back caches : When a user writes some data, the data is stored in the

cache and is not synchronized with the memory. Instead when the cache

content is different than the memory content, a bit entry is made in the cache.

While cleaning up caches the system checks for the entry in cache and if the

bit is set the system writes the changes to the memory.

10

Page 558: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

11

Page 559: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Introduction

• OpenMP is :– an API (Application Programming Interface)

– NOT a programming language

– A set of compiler directives that help the application developer to parallelize their workload.

– A collection of the directives, environment variables and the library routines

• OpenMP is composed of the following main components : – Directives

– Runtime library routines

– Environment variables

12

Page 560: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Components of OpenMP

Environment variables

Number of threads

Scheduling type

Dynamic thread adjustment

Nested Parallelism

13

Directives

Parallel regions

Work sharing

Synchronization

Data scope attributes :• private• firstprivate• last private• shared• reduction

Orphaning

Runtime library routines

Number of threads

Thread ID

Dynamic thread adjustment

Nested Parallelism

Timers

API for locking

Page 561: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Architecture

14

Operating System level Threads

OpenMP Runtime Library

Application

Environment Variables

User

Compiler Directives

Inspired by OpenMp.org introductory slides

Page 562: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

15

Page 563: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Runtime Library Routines

• Runtime library routines help manage parallel programs

• Many runtime library routines have corresponding environment

variables that can be controlled by the users

• Runtime libraries can be accessed by including omp.h in

applications that use OpenMP : #include <omp.h>

• For example for calls like :

– omp_get_num_threads(), (by which an openMP program determines

the number of threads available for execution) can be controlled using

an environment variable set at the command-line of a shell

($OMP_NUM_THREADS)

• Some of the activities that the OpenMP libraries help manage are :

– Determining the number of threads/processors

– Scheduling policies to be used

– General purpose locking and portable wall clock timing routines

16

Page 564: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

17

OpenMP : Runtime Library

Function: omp_get_num_threads()

C/ C++ int omp_get_num_threads(void);

Fortran integer function omp_get_num_threads()

Description:

Returns the total number of threads currently in the group executing the parallel

block from where it is called.

Function: omp_get_thread_num()

C/ C++ int omp_get_thread_num(void);

Fortran integer function omp_get_thread_num()

Description:

For the master thread, this function returns zero. For the child nodes the call returns

an integer between 1 and omp_get_num_threads()-1 inclusive.

Page 565: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Environment Variables

• OpenMP provides 4 main environment variables for

controlling execution of parallel codes:

OMP_NUM_THREADS – controls the parallelism of the

OpenMP application

OMP_DYNAMIC – enables dynamic adjustment of number of

threads for execution of parallel regions

OMP_SCHEDULE – controls the load distribution in loops such

as do, for

OMP_NESTED – Enables nested parallelism in OpenMP

applications

18

Page 566: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Environment Variables

19

Environment

Variable:

OMP_NUM_THREADS

Usage :

bash/sh/ksh:

csh/tcsh

OMP_NUM_THREADS n

export OMP_NUM_THREADS=8

setenv OMP_NUM_THREADS 8

Description:

Sets the number of threads to be used by the OpenMP program during execution.

Environment

Variable:

OMP_DYNAMIC

Usage :

bash/sh/ksh:

csh/tcsh

OMP_DYNAMIC {TRUE|FALSE}

export OMP_DYNAMIC=TRUE

setenv OMP_DYNAMIC “TRUE”

Description:

When this environment variable is set to TRUE the maximum number of threads available

for use by the OpenMP program is n ($OMP_NUM_THREADS).

Page 567: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Environment Variables

20

Environment

Variable:

OMP_SCHEDULE

Usage :

bash/sh/ksh:

csh/tcsh

OMP_SCHEDULE “schedule,[chunk]”

export OMP_SCHEDULE static,N/P

setenv OMP_SCHEDULE=“GUIDED,4”

Description:

Only applies to for and parallel for directives. This environment variable sets the

schedule type and chunk size for all such loops. The chunk size can be provided as an

integer number, the default being 1.

Environment

Variable:

OMP_NESTED

Usage :

bash/sh/ksh:

csh/tcsh

OMP_NESTED {TRUE|FALSE}

export OMP_NESTED FALSE

setenv OMP_NESTED=“FALSE”

Description:

Setting this environment variable to TRUE enables multi-threaded execution of inner

parallel regions in nested parallel regions.

Page 568: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP : Basic Constructs

C / C++ :

#pragma omp parallel {

parallel block

} /* omp end parallel */

21

OpenMP Execution Model (FORK/JOIN):

Sequential Part (master thread)

Parallel Region (FORK : group of threads)

Sequential Part (JOIN: master thread)

Parallel Region (FORK: group of threads)

Sequential Part (JOIN : master thread)

To invoke library routines in C/C++ add

#include <omp.h> near the top of your code

Page 569: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

HelloWorld in OpenMP

22

#include <omp.h>

main () {

int nthreads, tid;

#pragma omp parallel private(nthreads, tid){

tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);if (tid == 0) {

nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);

}}

}

Code segment that will be executed in parallel

OpenMP directive to indicate START segment to be parallelized

OpenMP directive to indicate END segment to be parallelized

Non shared copies of data for each thread

Page 570: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Execution

• On encountering the C construct #pragma omp parallel{, n-1 extra threads are created

• omp_get_thread_num() returns a unique identifier for each thread that can be utilized. The value returned by this call is between 0 and (OMP_NUM_THREADS – 1)

• omp_get_num_threads() returns the total number of threads involved in the parallel section of the program

• Code after the parallel directive is executed independently on each of the nthreads.

• On encountering the C construct } (corresponding to #pragma omp parallel{ ), indicates the end of parallel execution of the code segment, the n-1 extra threads are deactivated and normal sequential execution begins.

23

Page 571: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Compiling OpenMP Programs

Fortran :

• Case insensitive directives

• Syntax :

– !$OMP directive [clause[[,] clause]…] (free format)

– !$OMP / C$OMP / *$OMP directive [clause[[,] clause]…] (free format)

• Compiling OpenMP source code :

– (GNU Fortran compiler) : gfortran –fopenmp –o exec_name file_name.f95

– (Intel Fortran compiler) : ifort -o exe_file_name –openmp file_name.f

24

C :

• Case sensitive directives

• Syntax :

– #pragma omp directive [clause [clause]..]

• Compiling OpenMP source code :

– (GNU C compiler) : gcc –fopenmp –o exec_name file_name.c

– (Intel C compiler) : icc –o exe_file_name –openmp file_name.c

Page 572: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

DEMO : Hello World

25

Page 573: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

26

Page 574: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP : Data Environment

• OpenMP program always begins with a single thread of control – master

thread

• Context associated with the master thread is also known as the Data

Environment.

• Context is comprised of :

– Global variables

– Automatic variables

– Dynamically allocated variables

• Context of the master thread remains valid throughout the execution of the

program

• The OpenMP parallel construct may be used to either share a single copy of

the context with all the threads or provide each of the threads with a private

copy of the context.

• The sharing of Context can be performed at various levels of granularity

– Select variables from a context can be shared while keeping the context private

etc.

27

Page 575: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Data Environment

• OpenMP data scoping clauses allow a programmer to decide a variable’s execution context (should a variable be shared or private.)

• 3 main data scoping clauses in OpenMP (Shared, Private, Reduction) :

• Shared :

– A variable will have a single storage location in memory for the duration of the parallel construct, i.e. references to a variable by different threads access the same memory location.

– That part of the memory is shared among the threads involved, hence modifications to the variable can be made using simple read/write operations

– Modifications to the variable by different threads is managed by underlying shared memory mechanisms

• Private :

– A variable will have a separate storage location in memory for each of the threads involved for the duration of the parallel construct.

– All read/write operations by the thread will affect the thread’s private copy of the variable .

• Reduction :

– Exhibit both shared and private storage behavior. Usually used on objects that are the target of arithmetic reduction.

– Example : summation of local variables at the end of a parallel construct

28

Page 576: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Work-Sharing Directives• Work sharing constructs divide the execution of the

enclosed block of code among the group of threads.

• They do not launch new threads.

• No implied barrier on entry

• Implicit barrier at the end of work-sharing construct

• Commonly used Work Sharing constructs :

– for directive (C/C++ ; equivalent DO construct available in

Fortran but will not be covered here) : shares iterations of a

loop across a group of threads

– sections directive : breaks work into separate sections

between the group of threads; such that each thread

independently executes a section of the work.

– critical directive: serializes a section of code

29

Page 577: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP schedule clause

• The schedule clause defines how the iterations of a loop are divided

among a group of threads

• static : iterations are divided into pieces of size chunk and are

statically assigned to each of the threads in a round robin fashion

• dynamic : iterations divided into pieces of size chunk and

dynamically assigned to a group of threads. After a thread finishes

processing a chunk, it is dynamically assigned the next set of

iterations.

• guided : For a chunk of size of 1, the size of each chunk is

proportional to the number of unassigned iterations divided by the

number of threads, decreasing to 1. For a chunk with value k, the

same algorithm is used for determining the chunk size with the

constraint that no chunk should have less than k chunks except the

last chunk.

• Default schedule is implementation specific while the default chunk

size is usually 1

30

Page 578: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP for directive

• for directive helps share iterations of a loop

between a group of threads

• If nowait is specified then the threads do not wait

for synchronization at the end of a parallel loop

• The schedule clause describes how iterations of

a loop are divided among the threads in the team

(discussed in detail in the next few slides)

31

#pragma omp parallel

{

p=5;

#pragma omp for

for (i=0; i<24; i++)

x[i]=y[i]+p*(i+3)

} /* omp end parallel */

p=5

i =0,4

p=5

i= 5,9

p=5

i= 20,24

fork

join

do / for loop

x[i]=y[i]+

x[i]=y[i]+

x[i]=y[i]+

Page 579: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Simple Loop Parallelization

#pragma omp parallel for

for (i=0; i<n; i++)

z( i) = a*x(i)+y

32

Master thread executing serial portion of the code

Master thread encounters parallel for loop and creates worker threads

Master and worker threads divide iterations of the for loop and execute them concurrently

Implicit barrier: wait for all threads to finish their executions

Master thread executing serial portion of the code resumes and slave threads are discarded

Page 580: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Example: OpenMP work sharing

Constructs

33

#include <omp.h>#define N 16main (){int i, chunk;float a[N], b[N], c[N];for (i=0; i < N; i++)a[i] = b[i] = i * 1.0;

chunk = 4;printf("a[i] + b[i] = c[i] \n");#pragma omp parallel shared(a,b,c,chunk) private(i){#pragma omp for schedule(dynamic,chunk) nowaitfor (i=0; i < N; i++)c[i] = a[i] + b[i];

} /* end of parallel section */for (i=0; i < N; i++)

printf(" %f + %f = %f \n",a[i],b[i],c[i]);}

Initializing the vectors a[i], b[i]

Instructing the runtime environment that a,b,c,chunk are shared variables and I is a private variable

Load balancing the threads using a DYNAMIC policy where array is divided into chunks of 4 and assigned to the threads

The nowait ensures that the child threads donot synchronize once their work is completed

Modified from examples posted on: https://computing.llnl.gov/tutorials/openMP/

Page 581: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

DEMO : Work Sharing Constructs :

Shared / Private / Schedule

• Vector addition problem to be used

• Two vectors a[i] + b[i] = c[i] a[i] + b[i] = c[i]

0.000000 + 0.000000 = 0.000000

1.000000 + 1.000000 = 2.000000

2.000000 + 2.000000 = 4.000000

3.000000 + 3.000000 = 6.000000

4.000000 + 4.000000 = 8.000000

5.000000 + 5.000000 = 10.000000

6.000000 + 6.000000 = 12.000000

7.000000 + 7.000000 = 14.000000

8.000000 + 8.000000 = 16.000000

9.000000 + 9.000000 = 18.000000

10.000000 + 10.000000 = 20.000000

11.000000 + 11.000000 = 22.000000

12.000000 + 12.000000 = 24.000000

13.000000 + 13.000000 = 26.000000

14.000000 + 14.000000 = 28.000000

15.000000 + 15.000000 = 30.000000

34

Page 582: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP sections directive • sections directive is a non iterative work sharing

construct.

• Independent section of code are nested within a sections directive

• It specifies enclosed section of codes between different threads

• Code enclosed within a section directive is executed by a thread within the pool of threads

35

#pragma omp parallel private(p)

{

#pragma omp sections

{{ a=…;

b=…;}

#pragma omp section

{ p=…;

q=…;}

#pragma omp section

{ x=…;

y=…;}

} /* omp end sections */

} /* omp end parallel */

a =

b =

p =

q =

x =

y =

fork

join

Page 583: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Understanding variables in OpenMP

• Shared variable z is modified by multiple threads

• Each iteration reads the scalar variables a and y

and the array element x[i]

• a,y,x can be read concurrently as their values

remain unchanged.

• Each iteration writes to a distinct element of z[i]

over the index range. Hence write operations can

be carried out concurrently with each iteration

writing to a distinct array index and memory

location

• The parallel for directive in OpenMP ensures that

the for loop index value (i in this case) is private to

each thread.

36

i i i i

z[ ] a x[ ] y n i

#pragma omp parallel for

for (i=0; i<n; i++)

z[i] = a*x[i]+y

Page 584: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Example : OpenMP Sections

37

#include <omp.h>#define N 16main (){int i;float a[N], b[N], c[N], d[N];for (i=0; i < N; i++)

a[i] = b[i] = i * 1.5;#pragma omp parallel shared(a,b,c,d) private(i){#pragma omp sections nowait

{#pragma omp sectionfor (i=0; i < N; i++)c[i] = a[i] + b[i];

#pragma omp sectionfor (i=0; i < N; i++)d[i] = a[i] * b[i];

} /* end of sections */} /* end of parallel section */…

Section : that computes the sum of the 2 vectors

Section : that computes the product of the 2 vectors

Sections construct that encloses the section calls

Modified from examples posted on: https://computing.llnl.gov/tutorials/openMP/

Page 585: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

DEMO : OpenMP Sections

38

[LSU760000@n00 l12]$ ./sections a[i] b[i] a[i]+b[i] a[i]*b[i] 0.000000 0.000000 0.000000 0.000000 1.500000 1.500000 3.000000 2.250000 3.000000 3.000000 6.000000 9.000000 4.500000 4.500000 9.000000 20.250000 6.000000 6.000000 12.000000 36.000000 7.500000 7.500000 15.000000 56.250000 9.000000 9.000000 18.000000 81.000000 10.500000 10.500000 21.000000 110.250000 12.000000 12.000000 24.000000 144.000000 13.500000 13.500000 27.000000 182.250000 15.000000 15.000000 30.000000 225.000000 16.500000 16.500000 33.000000 272.250000 18.000000 18.000000 36.000000 324.000000 19.500000 19.500000 39.000000 380.250000 21.000000 21.000000 42.000000 441.000000 22.500000 22.500000 45.000000 506.250000

Page 586: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

39

Page 587: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Thread Synchronization

• “communication” mainly through read write operations on shared

variables

• Synchronization defines the mechanisms that help in coordinating

execution of multiple threads (that use a shared context) in a parallel

program.

• Without synchronization, multiple threads accessing shared memory

location may cause conflicts by :

– Simultaneously attempting to modify the same location

– One thread attempting to read a memory location while another thread is

updating the same location.

• Synchronization helps by providing explicit coordination between

multiple threads.

• Two main forms of synchronization :

– Implicit event synchronization

– Explicit synchronization – critical, master directives in OpenMP

40

Page 588: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Basic Types of Synchronization

• Explicit Synchronization via mutual exclusion

– Controls access to the shared variable by providing a thread exclusive

access to the memory location for the duration of its construct.

– Critical directive of OpenMP provides mutual exclusion

• Event Synchronization

– Signals occurrence of an event across multiple threads.

– Barrier directives in OpenMP provide the simplest form of event

synchronization

– The barrier directive defines a point in a parallel program where each

thread waits for all other threads to arrive. This helps to ensure that all

threads have executed the same code in parallel upto the barrier.

– Once all threads arrive at the point, the threads can continue execution

past the barrier.

• Additional synchronization mechanisms available in OpenMP

41

Page 589: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP Synchronization : master

• The master directive in OpenMP marks a block of code that gets

executed on a single thread.

• The rest of the threads in the group ignore the portion of code

marked by the master directive

• Example

#pragma omp master structured block

42

Race Condition :

Two asynchronous threads access the same shared variable and at least one modifies the variable and the sequence of operations is undefined . Result of these asynchronous operations depends on detailed timing of the individual threads of the group.

Page 590: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP critical directive :

Explicit Synchronization

• Race conditions can be avoided by controlling access to shared variables by allowing threads to have exclusive access to the variables

• Exclusive access to shared variables allows the thread to atomically perform read, modify and update operations on the variable.

• Mutual exclusion synchronization is provided by the critical directive of OpenMP

• Code block within the critical region defined by critical /end critical directives can be executed only by one thread at a time.

• Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.

43

fork

join

- critical region

int x

x=0;

#pragma omp parallel shared(x)

{

#pragma omp critical

x = 2*x + 1;

} /* omp end parallel */

Page 591: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Simple Example : critical

44

cnt = 0;

f = 7;

#pragma omp parallel

{

#pragma omp for

for (i=0;i<20;i++){

if(b[i] == 0){

#pragma omp critical

cnt ++;

} /* end if */

a[i]=b[i]+f*(i+1);

} /* end for */

} /* omp end parallel */

cnt=0f=7

i =0,4 i=5,9 i= 20,24i= 10,14

if …if …

if … i= 20,24

cnt++

cnt++

cnt++

cnt++a[i]=b[i]+…

a[i]=b[i]+…

a[i]=b[i]+…

a[i]=b[i]+…

Page 592: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

45

Page 593: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

OpenMP : Reduction

• performs reduction on shared variables in list based on the operator provided.

• for C/C++ operator can be any one of :

– +, *, -, ^, |, ||, & or &&

– At the end of a reduction, the shared variable contains the result obtained upon

combination of the list of variables processed using the operator specified.

46

sum = 0.0

#pragma omp parallel for reduction(+:sum)

for (i=0; i < 20; i++)

sum = sum + (a[i] * b[i]);

sum=0

i=0,4 i=5,9 i=10,14 i=15,19

sum=.. sum=.. sum=.. sum=..

∑sum

sum=0

Page 594: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Example: Reduction

47

#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++){

a[i] = i * 1.0;b[i] = i * 2.0;

}#pragma omp parallel for default(shared) private(i) \

schedule(static,chunk) reduction(+:result)for (i=0; i < n; i++)

result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors

∑a[i]*b[i]

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 595: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Demo: Dot Product using Reduction

48

[LSU760000@n00 l12]$ ./reduction a[i] b[i] a[i]*b[i]0.000000 0.000000 0.0000001.000000 2.000000 2.0000002.000000 4.000000 8.0000003.000000 6.000000 18.0000004.000000 8.000000 32.0000005.000000 10.000000 50.0000006.000000 12.000000 72.0000007.000000 14.000000 98.0000008.000000 16.000000 128.0000009.000000 18.000000 162.00000010.000000 20.000000 200.00000011.000000 22.000000 242.00000012.000000 24.000000 288.00000013.000000 26.000000 338.00000014.000000 28.000000 392.00000015.000000 30.000000 450.000000Final result= 2480.000000

Page 596: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

49

Page 597: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Synopsis of Commands

• How to invoke OpenMP runtime systems #pragma omp parallel

• The interplay between OpenMP environment variables and

runtime system (omp_get_num_threads(),

omp_get_thread_num())

• Shared data directives such as shared, private and reduction

• Basic flow control using sections, for

• Fundamentals of synchronization using critical directive and

critical section.

• And directives used for the OpenMP programming part of the

problem set.

50

Page 598: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Topics

• Review of HPC Models

• Shared Memory: Performance concepts

• Introduction to OpenMP

• OpenMP: Runtime Library & Environment Variables

• OpenMP: Data & Work sharing directives

• OpenMP: Synchronization

• OpenMP: Reduction

• Synopsis of Commands

• Summary Materials for Test

51

Page 599: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

Summary – Material for Test

• HPC Modalities – 4,5

• Performance issues in shared memory programming – 7,

8, 9, 10

• OpenMP runtime library routines – 16, 17

• OpenMP environment variables – 18, 19, 20

• OpenMP data environment 27, 28

• OpenMP work sharing directives – 29, 30, 31, 35, 36

• OpenMP thread synchronization – 40, 41, 42, 43

• OpenMP reduction 46

52

Page 600: Paralel Computing

CSC 7600 Lecture 12 : OpenMPSpring 2011

53

Page 601: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

APPLIED PARALLEL ALGORITHMS 1

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 10th, 2011

Page 602: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Dr. Hartmut Kaiser

Center for Computation & Technology

R315 Johnston

[email protected]

2

Page 603: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Puzzle of the Day

• What’s the difference between the following valid C

function declarations:

void foo();void foo(void);void foo(…);

Page 604: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Puzzle of the Day

• What’s the difference between the following valid C

function declarations:

• What’s the difference between the following valid C++ function declarations:

void foo();void foo(void);void foo(…);

void foo(); any number of parametersvoid foo(void); no parametervoid foo(…); any number of parameters

Page 605: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Puzzle of the Day

• What’s the difference between the following valid C

function declarations:

void foo(); any number of parametersvoid foo(void); no parametersvoid foo(…); any number of parameters

• What’s the difference between the following valid C++ function declarations:

void foo(); no parametersvoid foo(void); no parametersvoid foo(…); any number of parameters

Page 606: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

6

Topics

• Introduction

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Vector Dot-Product

• Matrix Multiplication

Page 607: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

7

Topics

• Introduction

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Vector Dot-Product

• Matrix Multiplication

Page 608: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

8

Parallel Programming

• Goals

– Correctness

– Reduction in execution time

– Efficiency

– Scalability

– Increased problem size and richness of models

• Objectives

– Expose parallelism

• Algorithm design

– Distribute work uniformly

• Data decomposition and allocation

• Dynamic load balancing

– Minimize overhead of synchronization and communication

• Coarse granularity

• Big messages

– Minimize redundant work

• Still sometimes better than communication

Page 609: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

9

Basic Parallel (MPI) Program Steps

• Establish logical bindings

• Initialize application execution environment

• Distribute data and work

• Perform core computations in parallel (across nodes)

• Synchronize and Exchange intermediate data results– Optional for non-embarrassingly parallel (cooperative)

• Detect “stop” condition– Maybe implicit with a barrier etc.

• Aggregate final results– Often a reduction operator

• Output results and error code

• Terminate and return to OS

Page 610: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

10

“embarrassingly parallel”

• Common phrase

– poorly defined,

– widely used

• Suggests lots and lots of parallelism

– with essentially no inter task communication or coordination

– Highly partitionable workload with minimal overhead

• “almost embarrassingly parallel”

– Same as above, but

– Requires master to launch many tasks

– Requires master to collect final results of tasks

– Sometimes still referred to as “embarrassingly parallel”

Page 611: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

11

Topics

• Introduction

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Vector Dot-Product

• Matrix Multiplication

Page 612: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Mandelbrot set

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.

Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

12

Page 613: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson

& M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Mandelbrot Set

Set of points in a complex plane that are quasi-stable (will

increase and decrease, but not exceed some limit) when

computed by iterating the function

where zk+1 is the (k + 1)th iteration of the complex number z =

(a + bi) and c is a complex number giving position of point in

the complex plane. The initial value for z is zero.

Iterations continued until magnitude of z is greater than 2 or

number of iterations reaches arbitrary limit. Magnitude of z

is the length of the vector given by

13

Page 614: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Sequential routine computing value of

one point returning number of iterations

structure complex {

float real;

float imag;

};

int cal_pixel(complex c)

{

int count, max;

complex z;

float temp, lengthsq;

max = 256;

z.real = 0; z.imag = 0;

count = 0; /* number of iterations */

do {

temp = z.real * z.real - z.imag * z.imag + c.real;

z.imag = 2 * z.real * z.imag + c.imag;

z.real = temp;

lengthsq = z.real * z.real + z.imag * z.imag;

count++;

} while ((lengthsq < 4.0) && (count < max));

return count;

}

14

Page 615: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Parallelizing Mandelbrot Set Computation

Static Task Assignment

Simply divide the region into fixed number of parts, each

computed by a separate processor.

Not very successful because different regions require

different numbers of iterations and time.

Dynamic Task Assignment

Have processor request regions after computing previous

regions

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

15

Page 616: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Dynamic Task AssignmentWork Pool/Processor Farms

16

Page 617: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

17

Flowchart for Mandelbrot Set

Generation“master” “workers”

Initialize MPI Environment

Initialize MPI Environment

Initialize MPI Environment … Initialize MPI

Environment

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Calculate Mandelbrot set

values across work region

… Calculate

Mandelbrot set values across work region

Calculate Mandelbrot set

values across work region

Calculate Mandelbrot set

values across work region

Write result from task 0 to file

Recv. results from “workers”

Send result to “master”

Send result to “master”

Send result to “master”…

Concatenate results to file

End

Page 618: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

18

Mandelbrot Sets (source code)#include<stdio.h>

#include<assert.h>

#include<stdlib.h>

#include<mpi.h>

typedef struct complex{

double real;

double imag;

} Complex;

int cal_pixel(Complex c){

int count, max_iter;

Complex z;

double temp, lengthsq;

max_iter = 256;

z.real = 0;

z.imag = 0;

count = 0;

do{

temp = z.real * z.real - z.imag * z.imag + c.real;

z.imag = 2 * z.real * z.imag + c.imag;

z.real = temp;

lengthsq = z.real * z.real + z.imag * z.imag;

count ++;

}

while ((lengthsq < 4.0) && (count < max_iter));

return(count);

} Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

cal_pixel () runs on every worker process calculates the :

for every pixel

Page 619: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

19

Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){FILE *file;int i, j; int tmp;Complex c;double *data_l, *data_l_tmp;int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype;MPI_Status status;

/***** Initializing MPI Environment*****/

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nprocs);MPI_Comm_rank(MPI_COMM_WORLD, &mype);

/***** Pass in the dimension (X,Y) of the area to cover *****/

if (argc != 3){int err = 0;printf("argc %d\n", argc);if (mype == MASTERPE){

printf("usage: mandelbrot nx ny");MPI_Abort(MPI_COMM_WORLD,err );

}}/* get command line args */nx = atoi(argv[1]);ny = atoi(argv[2]);

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Initialize MPI Environment

Check if the input arguments : x,y dimensions of the region to be processed are passed

Page 620: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

20

Mandelbrot Sets (source code)

/* assume divides equally */nrows_l = nx/nprocs; mystrt = mype*nrows_l;myend = mystrt + nrows_l - 1;

/* create buffer for local work only */data_l = (double *) malloc(nrows_l * ny * sizeof(double));data_l_tmp = data_l;

/* calc each procs coordinates and call local mandelbrot value generation function */for (i = mystrt; i <= myend; ++i){c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){

c.imag = j/((double) ny) * 4. - 2. ;

tmp = cal_pixel(c); *data_l++ = (double) tmp;

}}data_l = data_l_tmp;

Source :

http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Determining the dimensions of the work to be performed by each concurrent task.

Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated

Page 621: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

21

Mandelbrot Sets (source code)

if (mype == MASTERPE){file = fopen("mandelbrot.bin_0000", "w");printf("nrows_l, ny %d %d\n", nrows_l, ny);fwrite(data_l, nrows_l*ny, sizeof(double), file);fclose(file);for (i = 1; i < nprocs; ++i){

MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);printf("received message from proc %d\n", i);file = fopen("mandelbrot.bin_0000", "a");fwrite(data_l, nrows_l*ny, sizeof(double), file);fclose(file);}

}else{

MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD);}

MPI_Finalize();}

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Master process opens a file to store output into and stores its values in the file

Master then waits to receive values computed by each of the worker processes

Worker processes send computed mandelbrot values of their region to the master process

Page 622: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

22

Demo : Mandelbrot Sets

Page 623: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo: Mandelbrot Sets

23

Page 624: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

24

Topics

• Introduction

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Vector Dot-Product

• Matrix Multiplication

Page 625: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

25

Page 626: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Monte Carlo Simulation

• Used when it is infeasible or impossible to compute

an exact result with a deterministic algorithm

• Especially useful in

– Studying systems with a large number of coupled degrees

of freedom

• Fluids, disordered materials, strongly coupled solids, cellular

structures

– For modeling phenomena with significant uncertainty in

inputs

• The calculation of risk in business

– These methods are also widely used in mathematics

• The evaluation of definite integrals, particularly multidimensional

integrals with complicated boundary conditions

26

Page 627: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Monte Carlo Simulation

• No single approach, multitude of different methods

• Usually follows pattern

– Define a domain of possible inputs

– Generate inputs randomly from the domain

– Perform a deterministic computation using the inputs

– Aggregate the results of the individual computations into the final result

• Example: calculate Pi

27

Page 628: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

28

Monte Carlo: Algorithm for Pi

• The value of PI can be calculated in a number of

ways. Consider the following method of

approximating PI: Inscribe a circle in a square

• Randomly generate points in the square

• Determine the number of points in the square that

are also in the circle

• Let r be the number of points in the circle divided

by the number of points in the square

• PI ~ 4 r

• Note that the more points generated, the better

the approximation

• Algorithm :

npoints = 10000

circle_count = 0

do j = 1,npoints

generate 2 random numbers between 0 and 1

xcoordinate = random1 ; ycoordinate = random2

if (xcoordinate, ycoordinate) inside circle

then circle_count = circle_count + 1

end do

PI = 4.0*circle_count/npoints

Page 629: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

29

Page 630: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

30

OpenMP Pi Calculation

Initialize variables

Initialize OpenMP parallel environment

Calculate PI

Print value of pi

N WorkerThreadsMaster Thread

Generate random X,Y Generate random X,Y Generate random X,Y

Calculate Z=X^2+Y^2 Calculate Z =X^2+Y^2

If point lies within the circle

Calculate Z =X^2+Y^2

If point lies within the circle

If point lies within the circle

Count ++ Count ++Count ++

Reduction ∑

Y

N N N

Y Y

Page 631: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Calculating Pi

31

#include <omp.h>#include <stdlib.h>#include <stdio.h>#include <time.h>#define SEED 42

main(int argc, char* argv){

int niter=0;double x,y;int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */double z;double pi;time_t rawtime;struct tm * timeinfo;

printf("Enter the number of iterations used to estimate pi: ");scanf("%d",&niter);time ( &rawtime );timeinfo = localtime ( &rawtime );

Seed for generating random number

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Page 632: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Calculating Pi

32

printf ( "The current date/time is: %s", asctime (timeinfo) );/* initialize random numbers */srand(SEED);

#pragma omp parallel for private(x,y,z,tid) reduction(+:count)for ( i=0; i<niter; i++) {

x = (double)rand()/RAND_MAX;y = (double)rand()/RAND_MAX;z = (x*x+y*y);if (z<=1) count++;if (i==(niter/6)-1) {

tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

}if (i==(niter/3)-1) {

tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

}if (i==(niter/2)-1) {

tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

} http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Initialize random number generator; srand is used to seed the random number generated by rand()

Randomly generate x,y points

Initialize OpenMP parallel for with reduction(∑)

Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count

Page 633: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Calculating Pi

33

if (i==(2*niter/3)-1) {tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

}if (i==(5*niter/6)-1) {

tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

}if (i==niter-1) {

tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);

}}time ( &rawtime );timeinfo = localtime ( &rawtime );printf ( "The current date/time is: %s", asctime (timeinfo) );printf(" the total count is %i\n",count);pi=(double)count/niter*4;

printf("# of trials= %d , estimate of pi is %g \n",niter,pi);return 0;

}

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Calculate PI based on the aggregate count of the points that lie within the circle

Page 634: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo : OpenMP Pi

34

[cdekate@celeritas l13]$ ./omcpiEnter the number of iterations used to estimate pi: 100000The current date/time is: Tue Mar 4 05:53:52 2008thread 0 just did iteration 16665 the count is 13124thread 1 just did iteration 33332 the count is 6514thread 1 just did iteration 49999 the count is 19609thread 2 just did iteration 66665 the count is 13048thread 3 just did iteration 83332 the count is 6445thread 3 just did iteration 99999 the count is 19489The current date/time is: Tue Mar 4 05:53:52 2008the total count is 78320# of trials= 100000 , estimate of pi is 3.1328[cdekate@celeritas l13]$

Page 635: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

35

Creating Custom Communicators

• Communicators define groups and the access patterns

among them

• Default communicator is MPI_COMM_WORLD

• Some algorithms demand more sophisticated control of

communications to take advantage of reduction

operators

• MPI permits creation of custom communicators

• MPI_Comm_create

Page 636: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

36

MPI Monte Carlo Pi Computation

Initialize MPIEnvironment

Receive Request

Compute Random Array

Send Array to Requestor

Last Request?

Finalize MPI

Y

N

Server

Initialize MPI Environment

WorkerMaster

Receive Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Finalize MPI

N

Y

Propagate Number of Points (Allreduce)

Initialize MPI Environment

Broadcast Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Print Statistics

N

Y

Propagate Number of Points (Allreduce)

Finalize MPI

Output Partial Result

Page 637: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

37

Monte Carlo : MPI - Pi (source code)#include <stdio.h>#include <math.h>#include "mpi.h“#define CHUNKSIZE 1000#define INT_MAX 1000000000#define REQUEST 1#define REPLY 2int main( int argc, char *argv[] ){

int iter;int in, out, i, iters, max, ix, iy, ranks[1], done, temp;double x, y, Pi, error, epsilon;int numprocs, myid, server, totalin, totalout, workerid;int rands[CHUNKSIZE], request;MPI_Comm world, workers;MPI_Group world_group, worker_group;MPI_Status status;

MPI_Init(&argc,&argv);world = MPI_COMM_WORLD;MPI_Comm_size(world,&numprocs);MPI_Comm_rank(world,&myid);

Initialize MPI environment

Page 638: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

38

Monte Carlo : MPI - Pi (source code)

server = numprocs-1; /* last proc is server */if (myid == 0)

sscanf( argv[1], "%lf", &epsilon );

MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD );MPI_Comm_group( world, &world_group );ranks[0] = server;MPI_Group_excl( world_group, 1, ranks, &worker_group );

MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group);

if (myid == server) { do {

MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {

for (i = 0; i < CHUNKSIZE; ) {rands[i] = random();if (rands[i] <= INT_MAX) i++; }/* Send random number array*/

MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); }} while( request>0 );

}else { /* Begin Worker Block */

request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );MPI_Comm_rank( workers, &workerid );iter = 0;

Broadcast Error Bounds: epsilon

Create a custom communicator

Server process : 1. Receives request to generate a random ,2. Computes the random number array, 3. Send array to requestor

Worker process : Request the server to generate a random number array

Page 639: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

39

Monte Carlo : MPI - Pi (source code)while (!done) {

iter++;request = 1; /* Recv. random array from server*/

MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status );for (i=0; i<CHUNKSIZE-1; ) {

x = (((double) rands[i++])/max) * 2 - 1;y = (((double) rands[i++])/max) * 2 - 1;if (x*x + y*y < 1.0) in++;else out++;

}

MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers);MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers);Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi-3.141592653589793238462643);done = (error < epsilon || (totalin+totalout) > 1000000);request = (done) ? 0 : 1;if (myid == 0) { /* If “Master” : Print current value of PI */

printf( "\rpi = %23.20f", Pi );MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );

}else { /* If “Worker” : Request new array if not finished */

if (request)MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);

}}MPI_Comm_free(&workers);

}

Worker : Receive random number array from the Server

Worker: For each pair of x,y in the random number array, calculate the coordinates

Determine if the number is inside or out of the circle

Print current value of PI and request for more work

Compute the value of pi and Check if error is within threshhold

Page 640: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

40

Monte Carlo : MPI - Pi (source code)

if (myid == 0) {/* If “Master” : Print Results */

printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n",totalin+totalout, totalin, totalout );

getchar();}MPI_Finalize();

}

Print the final value of PI

Page 641: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

41

Demo : MPI Monte Carlo, Pi

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696

Page 642: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

42

Topics

• Introduction

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Vector Dot-Product

• Matrix Multiplication

Page 643: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Vector Dot Product

• Multiplication of 2 vectors followed by Summation

43

A[i]

X1

X2

X3

X4

X5

… …

Xn

B[i]

Y1

Y2

Y3

Y4

Y5

… …

Yn

∙ =n

i 1

A[i] * B[i]

X1* Y1

X2* Y2

X3* Y3

X4* Y4

X5* Y5

… …

Xn* Yn

Page 644: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

44

OpenMP Dot Product : using Reduction

Initialize variables

Initialize OpenMP parallel environment

Calculate local computations

Calculate local computations

Calculate local computations

REDUCTION : ∑

Print value of Dot Product

N WorkerThreadsMaster Thread

Master Thread

Workload and schedule is determined by OpenMP

during runtime

Page 645: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Dot Product

45

#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++){

a[i] = i * 1.0;b[i] = i * 2.0;

}#pragma omp parallel for default(shared) private(i) \

schedule(static,chunk) reduction(+:result)for (i=0; i < n; i++)

result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors

∑a*i+*b*i+

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 646: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo: Dot Product using Reduction

46

[cdekate@celeritas l12]$ ./reductiona[i] b[i] a[i]*b[i]0.000000 0.000000 0.0000001.000000 2.000000 2.0000002.000000 4.000000 8.0000003.000000 6.000000 18.0000004.000000 8.000000 32.0000005.000000 10.000000 50.0000006.000000 12.000000 72.0000007.000000 14.000000 98.0000008.000000 16.000000 128.0000009.000000 18.000000 162.00000010.000000 20.000000 200.00000011.000000 22.000000 242.00000012.000000 24.000000 288.00000013.000000 26.000000 338.00000014.000000 28.000000 392.00000015.000000 30.000000 450.000000Final result= 2480.000000[cdekate@celeritas l12]$

Page 647: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

47

MPI Dot Product Computation

Initialize Variables

WorkerMaster

Initialize MPI environment

Receive Size of vectors

Receive local workload for Vector A

Receive local workload for Vector B

Initialize Variables

Initialize MPI Environment

Broadcast Size of Vectors

Get Vector A &Distribute Partitioned Vector A

Get Vector B & Distribute Partitioned Vector B

Calculate dot-product for local workloads

Print Result

REDUCTION ∑

Calculate dot-product for local workloads

Page 648: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

48

#include <stdio.h>#include "mpi.h"#define MAX_LOCAL_ORDER 100main(int argc, char* argv[]) {

float local_x[MAX_LOCAL_ORDER];float local_y[MAX_LOCAL_ORDER];int n;int n_bar; /* = n/p */float dot;int p;int my_rank;void Read_vector(char* prompt, float local_v[], int n_bar, int p,

int my_rank);float Parallel_dot(float local_x[], float local_y[], int n_bar);

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &p);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);if (my_rank == 0) {

printf("Enter the order of the vectors\n");scanf("%d", &n);

}

MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

Initialize MPI Environment

Broadcast the order of vectors across the workers

Parallel Programming with MPI

by

Peter Pacheco

Page 649: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

49

n_bar = n/p;

Read_vector("the first vector", local_x, n_bar, p, my_rank);Read_vector("the second vector", local_y, n_bar, p, my_rank);

dot = Parallel_dot(local_x, local_y, n_bar);

if (my_rank == 0)printf("The dot product is %f\n", dot);

MPI_Finalize();} /* main */

void Read_vector(char* prompt /* in */,float local_v[] /* out */,int n_bar /* in */,int p /* in */,int my_rank /* in */) {

int i, q;

Receive and distribute the two vectors

Calculate the parallel dot product for local workloads

Master: Print the result of the dot product

Parallel Programming with MPI

by

Peter Pacheco

Page 650: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

50

float temp[MAX_LOCAL_ORDER];MPI_Status status;

if (my_rank == 0) {printf("Enter %s\n", prompt);for (i = 0; i < n_bar; i++)

scanf("%f", &local_v[i]);for (q = 1; q < p; q++) {

for (i = 0; i < n_bar; i++)scanf("%f", &temp[i]);

MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD);}

} else {MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD,

&status);}

} /* Read_vector */

float Serial_dot(float x[] /* in */,

MASTER: Get the input from the User prepare the local workload

Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing

Worker : Receive the local workload to be processed

Serial_dot() : calculates the dot product on local arrays

Parallel Programming with MPI by

Peter Pacheco

Page 651: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

51

float y[] /* in */,int n /* in */) {

int i;float sum = 0.0;for (i = 0; i < n; i++)

sum = sum + x[i]*y[i];return sum;

} /* Serial_dot */float Parallel_dot(

float local_x[] /* in */,float local_y[] /* in */,int n_bar /* in */) {

float local_dot;float dot = 0.0;

local_dot = Serial_dot(local_x, local_y, n_bar);MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT,

MPI_SUM, 0, MPI_COMM_WORLD);return dot;

} /* Parallel_dot */

Serial_dot() : calculates the dot product on local arrays

Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workload

Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM)

Parallel Programming with MPI

by

Peter Pacheco

Page 652: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo: MPI Dot Product

52

*cdekate@celeritas l13+$ mpirun …. ./mpi_dotEnter the order of the vectors16Enter the first vector0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Enter the second vector0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30The dot product is 2480.000000[cdekate@celeritas l13]$

Page 653: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

53

Topics

• Introduction

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Vector Dot-Product

• Matrix Multiplication

Page 654: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

54

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Vector Multiplication

Page 655: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

55

Matrix-Vector Multiplicationc = A xb

Page 656: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

56

Implementing Matrix Multiplication

Sequential Code

Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be

for (i = 0; i < n; i++)

for (j = 0; j < n; j++) {c[i][j] = 0;for (k = 0; k < n; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];}

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3).Very easy to parallelize.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.

Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Page 657: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Implementing Matrix Multiplication

• With n processors (and n x n matrices), we can obtain:

• Time complexity of O(n2) with n processors• Each instance of inner loop is independent and can be done by a

separate processor

• Time complexity of O(n) with n2 processors• One element of A and B assigned to each processor.

• Cost optimal since O(n3) = n x O(n2) = n2 x O(n).

• Time complexity of O(log n) with n3 processors• By parallelizing the inner loop.

• Not cost-optimal since O(n3) < n3 x O(log n).

• O(log n) lower bound for parallel matrix multiplication.

57

Page 658: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

58

Block Matrix Multiplication

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.

Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Partitioning into sub-matricies

Page 659: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

59

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.

Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Multiplication

Page 660: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

60

Performance Improvement

Using tree construction n numbers can be added in O(log n) steps (using n3 processors):

Slides for Parallel Programming Techniques & Applications Using Networked

Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @

2004 Pearson Education Inc. All rights reserved.

Page 661: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

61

OpenMP: Flowchart for Matrix Multiplication

Initialize variables & matrices

Initialize OpenMP Environment

Compute the Matrix product for the local workload

Print Results

Compute the Matrix product for the local workload

Compute the Matrix product for the local workload

Schedule and workload chunksize are determined based on user preferences

during compile/run time

Since each thread works on portion of the array and updates different parts of the same

array synchronization is not needed

Page 662: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Matrix Multiplication

62

#include <stdio.h>#include <omp.h>

/* Main Program */

main(){

int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k;NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4;float Matrix_A[NoofRows_A][NoofCols_A];float Matrix_B[NoofRows_B][NoofCols_B];float Result[NoofRows_A][NoofCols_B];

for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_A; j++)

Matrix_A[i][j] = i + j;}/* Matrix_B Elements */for (i = 0; i < NoofRows_B; i++) {

for (j = 0; j < NoofCols_B; j++)Matrix_B[i][j] = i + j;

}printf("The Matrix_A Is \n");

Initialize the two Matrices A[][] & B[][] with sum of their index values

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 663: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Matrix Multiplication

63

for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_A; j++)

printf("%f \t", Matrix_A[i][j]);printf("\n");

}printf("The Matrix_B Is \n");for (i = 0; i < NoofRows_B; i++) {

for (j = 0; j < NoofCols_B; j++)printf("%f \t", Matrix_B[i][j]);

printf("\n");}

for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_B; j++) {

Result[i][j] = 0.0;}

}#pragma omp parallel for private(j,k)

for (i = 0; i < NoofRows_A; i = i + 1)for (j = 0; j < NoofCols_B; j = j + 1)

for (k = 0; k < NoofCols_A; k = k + 1)Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j];

printf("\nThe Matrix Computation Result Is \n");

Initialize the results matrix with 0.0

Print the Matrices for debugging purposes

Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMPenvironment variables and the number of threads

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 664: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Matrix Multiplicaton

64

for (i = 0; i < NoofRows_A; i = i + 1) {for (j = 0; j < NoofCols_B; j = j + 1)

printf("%f ", Result[i][j]);printf("\n");

}}

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 665: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

DEMO : OpenMP Matrix Multiplication

65

[cdekate@celeritas l13]$ ./omp_mmThe Matrix_A Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000The Matrix_B Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000

The Matrix Computation Result Is14.000000 20.000000 26.000000 32.00000020.000000 30.000000 40.000000 50.00000026.000000 40.000000 54.000000 68.00000032.000000 50.000000 68.000000 86.000000[cdekate@celeritas l13]$

Page 666: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

66

Flowchart for MPI Matrix Multiplication

“master” “workers”

Initialize MPI Environment

Initialize MPI Environment

Initialize MPI Environment

… Initialize MPI Environment

Initialize Array

Partition Array into workloads

Send Workload to “workers”

Recv. work Recv. work … Recv. work

wait for “workers“ to finish task

Calculate matrix product

Calculate matrix product

Calculate matrix product

Send result Send result … Send result

Recv. results

Print results

End

Page 667: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

67

Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 4 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */

MPI_Status status;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Initialize the MPI environment

Source : http://www.llnl.gov/computing/

tutorials/mpi/samples/C/mpi_mm.c

Page 668: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

68

Matrix Multiplication (source code)if (numtasks < 2 ) {printf("Need at least two MPI tasks. Quitting...\n");MPI_Abort(MPI_COMM_WORLD, rc);exit(1);}

numworkers = numtasks-1;

if (taskid == MASTER){for (i=0; i<NRA; i++)for (j=0; j<NCA; j++){

a[i][j]= i+j+1;b[i][j]= i+j+1; }

printf("Matrix A :: \n");

for (i=0; i<NRA; i++){printf("\n");for (j=0; j<NCB; j++)

printf("%6.2f ", a[i][j]); }printf("Matrix B :: \n");for (i=0; i<NRA; i++) {

printf("\n");for (j=0; j<NCB; j++)

printf("%6.2f ", b[i][j]);averow = NRA/numworkers;extra = NRA%numworkers;offset = 0;mtype = FROM_MASTER;

Source : http://www.llnl.gov/computing/

tutorials/mpi/samples/C/mpi_mm.c

MASTER: Initialize the matrix A & B

Print the two matrices for Debugging purposes

Calculate the number of rows to be processed by each worker

Calculate the number of overflow rows to be processed additionally by each worker

Page 669: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

69

Matrix Multiplication (source code)for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */

rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);

MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);

offset = offset + rows;}

/* Receive results from worker tasks */mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */for (i=1; i<=numworkers; i++){

source = i;/* offset stores the (processing) starting point of work chunk */

MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status);printf("Received results from task %d\n",source);

}printf("******************************************************\n");printf("Result Matrix:\n");for (i=0; i<NRA; i++){

printf("\n"); for (j=0; j<NCB; j++)

printf("%6.2f ", c[i][j]);}printf("\n******************************************************\n");printf ("Done.\n");

}

MASTER : Send the workload chunk across to each of the worker

MASTER: Receive the workload chunk from the workersc[][] contains the matrix products calculated for each workload chunk by the corresponding worker

Source : http://www.llnl.gov/computing/

tutorials/mpi/samples/C/mpi_mm.c

Page 670: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

70

Matrix Multiplication (source code)/**************************** worker task ************************************/

if (taskid > MASTER){

mtype = FROM_MASTER;

MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++)for (i=0; i<rows; i++){

c[i][k] = 0.0;for (j=0; j<NCA; j++)

/* Calculate the product and store result in C */c[i][k] = c[i][k] + a[i][j] * b[j][k];

}mtype = FROM_WORKER;

MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);

/* Worker sends the resultant array to the master */MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);

}MPI_Finalize();

}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

WORKER: Receive the workload to be processed by each worker

Calculate the matrix product and store the result in c[][]

Send the computed results array to the Master

Source : http://www.llnl.gov/com

puting/tutorials/mpi/sample

s/C/mpi_mm.c

Page 671: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

71

Demo : Matrix Multiplication

[cdekate@celeritas matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Matrix A ::1.00 2.00 3.00 4.002.00 3.00 4.00 5.003.00 4.00 5.00 6.004.00 5.00 6.00 7.00

Matrix B ::1.00 2.00 3.00 4.002.00 3.00 4.00 5.003.00 4.00 5.00 6.004.00 5.00 6.00 7.00

Sending 2 rows to task 1 offset=0Sending 1 rows to task 2 offset=2Sending 1 rows to task 3 offset=3Received results from task 1Received results from task 2Received results from task 3Result Matrix:30.00 40.00 50.00 60.0040.00 54.00 68.00 82.0050.00 68.00 86.00 104.0060.00 82.00 104.00 126.00[cdekate@celeritas matrix_multiplication]$

Page 672: Paralel Computing

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

72

Page 673: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

APPLIED PARALLEL ALGORITHMS 2

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 18, 2011

Page 674: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Puzzle of the Day

• Some nice ways to get something different from what

was intended:

2

if(a = 0) { … }/* a always equals 0, but block will never be executed */

if(0 < a < 5) { … }/* this "boolean" is always true! [think: (0 < a) < 5] */

if(a =! 0) { … }/* a always equal to 1, as this is compiled as (a = !0), an assignment,

rather than (a != 0) or (a == !0) */

Page 675: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition

• Matrix Transpose

• Gauss-Jordan Elimination

• LU Decomposition

• Summary Materials for Test

3

Page 676: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition

• Matrix Transpose

• Gauss-Jordan Elimination

• LU Decomposition

• Summary Materials for Test

4

Page 677: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

5

Parallel Matrix Processing & Locality

• Maximize locality– Spatial locality

• Variable likely to be used if neighbor data is used

• Exploits unit or uniform stride access patterns

• Exploits cache line length

• Adjacent blocks minimize message traffic– Depends on volume to surface ratio

– Temporal locality• Variable likely to be reused if already recently used

• Exploits cache loads and LRU (least recently used) replacement policy

• Exploits register allocation

– Granularity• Maximizes length of local computation

• Reduces number of messages

• Maximizes length of individual messages

Page 678: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

6

Array Decomposition

• Simple MPI Example

• Master-Worker Data Partitioning and Distribution– Array decomposition

– Uniformly distributes parts of array among workers• (and master)

– A kind of static load balancing• Assumes equal work on equal data set sizes

• Demonstrates– Data partitioning

– Data distribution

– Coarse grain parallel execution• No communication between tasks

– Reduction operator

– Master-worker control model

Page 679: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

7

Array Decomposition Layout

• Dimensions – 1 dimension: linear (dot product)

– 2 dimensions: “2-D” or (matrix operations)

– 3 dimensions (higher order models)

– Impacts surface to volume ratio for inter process communications

• Distribution – Block

• Minimizes messaging

• Maximizes message size

– Cyclic

• Improves load balancing

• Memory layout– C vs. FORTRAN

Page 680: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

8

Array Decomposition

Accumulate sum from each part

rayCompleteAr

Page 681: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

9

Array Decomposition

Demonstrate simple data decomposition :

– Master initializes array and then distributes an equal portion of the array

among the other tasks.

– The other tasks receive their portion of the array, they perform an

addition operation to each array element.

– Each task maintains the sum for their portion of the array

– The master task does likewise with its portion of the array.

– As each of the non-master tasks finish, they send their updated portion

of the array to the master.

– An MPI collective communication call is used to collect the sums

maintained by each task.

– Finally, the master task displays selected parts of the final array and the

global sum of all array elements.

– Assumption : that the array can be equally divided among the group.

Page 682: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

10

Flowchart for Array Decomposition“master” “workers”

Initialize MPI Environment

Initialize MPI Environment

Initialize MPI Environment

… Initialize MPI Environment

Initialize Array

Partition Array into workloads

Send Workload to “workers”

Recv. work Recv. work … Recv. work

Calculate Sum for array chunk

Calculate Sum for array chunk

Calculate Sum for array chunk

Calculate Sum for array chunk

Send Sum Send Sum … Send Sum

Recv. results

Reduction Operator to Sum up results

Print results

End

Page 683: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

11

Array Decompositon (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define ARRAYSIZE 16000000#define MASTER 0

float data[ARRAYSIZE];int main (int argc, char **argv){int numtasks, taskid, rc, dest, offset, i, j, tag1,

tag2, source, chunksize; float mysum, sum;float update(int myoffset, int chunk, int myid);

MPI_Status status;

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks % 4 != 0) {

printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/MPI_Abort(MPI_COMM_WORLD, rc);exit(0);}

MPI_Comm_rank(MPI_COMM_WORLD,&taskid);printf ("MPI task %d has started...\n", taskid);

chunksize = (ARRAYSIZE / numtasks);tag2 = 1;tag1 = 2;

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Workload to be processed by each processor

Page 684: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

12

Array Decompositon (source code)

if (taskid == MASTER){sum = 0;

for(i=0; i<ARRAYSIZE; i++) {data[i] = i * 1.0;sum = sum + data[i];}

printf("Initialized array sum = %e\n",sum);

offset = chunksize;for (dest=1; dest<numtasks; dest++) {MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD);printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset);offset = offset + chunksize;}

offset = 0;

mysum = update(offset, chunksize, taskid);

for (i=1; i<numtasks; i++) {source = i;

MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Initialize array

Array[0] -> Array[offset-1] is processed by master

Send workloads to respective processorsMaster computes

local Sum

Master receives summation computed by workers

Page 685: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

13

Array Decompositon (source code)

MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);printf("Sample results: \n");offset = 0;for (i=0; i<numtasks; i++) {

for (j=0; j<5; j++) printf(" %e",data[offset+j]);

printf("\n");offset = offset + chunksize;}

printf("*** Final sum= %e ***\n",sum);} /* end of master section */

if (taskid > MASTER) {/* Receive my portion of array from the master task */source = MASTER;

MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);mysum = update(offset, chunksize, taskid);/* Send my results back to the master task */dest = MASTER;

MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD);MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);} /* end of non-master */

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Master computes the SUM of all workloads

Worker processes receive work chunks from master

Each worker computes local sum

Send local sum to master process

Page 686: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

14

Array Decompositon (source code)

MPI_Finalize();

} /* end of main */

float update(int myoffset, int chunk, int myid) {int i; float mysum;/* Perform addition to each of my array elements and keep my sum */mysum = 0;for(i=myoffset; i < myoffset + chunk; i++) {data[i] = data[i] + i * 1.0;mysum = mysum + data[i];}

printf("Task %d mysum = %e\n",myid,mysum);return(mysum);}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Page 687: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

15

Demo : Array Decomposition

[lsu00@master array_decomposition]$ mpiexec -np 4 ./array

MPI task 0 has started...

MPI task 2 has started...

MPI task 1 has started...

MPI task 3 has started...

Initialized array sum = 1.335708e+14

Sent 4000000 elements to task 1 offset= 4000000

Sent 4000000 elements to task 2 offset= 8000000

Task 1 mysum = 4.884048e+13

Sent 4000000 elements to task 3 offset= 12000000

Task 2 mysum = 7.983003e+13

Task 0 mysum = 1.598859e+13

Task 3 mysum = 1.161867e+14

Sample results:

0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00

8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06

1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07

2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07

*** Final sum= 2.608458e+14 ***

Output from arete for a 4 processor run.

Page 688: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition

• Matrix Transpose

• Gauss-Jordan Elimination

• LU Decomposition

• Summary Materials for Test

16

Page 689: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose• The transpose of the (m x n) matrix A is the (n x m) matrix

formed by interchanging the rows and columns such that row ibecomes column i of the transposed matrix

mnnn

m

m

T

aaa

aaa

aaa

21

22212

12111

A

mnmm

n

n

aaa

aaa

aaa

21

22221

11211

A

010

431A

04

13

01

TA

52

31A

53

21TA

17

Page 690: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - OpenMP

18

#include <stdio.h>#include <sys/time.h>#include <omp.h>#define SIZE 4

main(){

int i, j;float Matrix[SIZE][SIZE], Trans[SIZE][SIZE];for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)Matrix[i][j] = (i * j) * 5 + i;

}for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)Trans[i][j] = 0.0;

}

Initialize source matrix

Initialize results matrix

Page 691: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - OpenMP

19

#pragma omp parallel for private(j)for (i = 0; i < SIZE; i++)

for (j = 0; j < SIZE; j++)

Trans[j][i] = Matrix[i][j];printf("The Input Matrix Is \n");for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)printf("%f \t", Matrix[i][j]);

printf("\n");}printf("\nThe Transpose Matrix Is \n");for (i = 0; i < SIZE; i++) {

for (j = 0; j < SIZE; j++)printf("%f \t", Trans[i][j]);

printf("\n");}

return 0;}

Perform transpose in parallel using omp parallel for

Page 692: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose – OpenMP (DEMO)

20

[LSU760000@n01 matrix_transpose]$ ./omp_mtrans

The Input Matrix Is 0.000000 0.000000 0.0000000 0.0000000 1.000000 6.000000 11.000000 16.000000 2.000000 12.000000 22.000000 32.000000 3.000000 18.000000 33.000000 48.000000

The Transpose Matrix Is 0.000000 1.0000000 2.0000000 3.0000000 0.000000 6.0000000 12.000000 18.000000 0.000000 11.000000 22.000000 33.000000 0.000000 16.000000 32.000000 48.000000

Page 693: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

21

#include <stdio.h>#include "mpi.h"#define N 4int A[N][N];void fill_matrix(){int i,j;for(i = 0; i < N; i ++)

for(j = 0; j < N; j ++)A[i][j] = i * N + j;

}void print_matrix(){int i,j;for(i = 0; i < N; i ++) {

for(j = 0; j < N; j ++)printf("%d ", A[i][j]);

printf("\n");}

}

Initialize source matrix

Page 694: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

22

main(int argc, char* argv[]){int r, i;MPI_Status st;MPI_Datatype typ;

MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &r);

if(r == 0) {fill_matrix();printf("\n Source:\n");print_matrix();MPI_Type_contiguous(N * N, MPI_INT, &typ);MPI_Type_commit(&typ);MPI_Barrier(MPI_COMM_WORLD);MPI_Send(&(A[0][0]), 1, typ, 1, 0, MPI_COMM_WORLD);

}

Creating custom MPI datatypeto store local workloads

Page 695: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

23

else if(r == 1){MPI_Type_vector(N, 1, N, MPI_INT, &typ);MPI_Type_hvector(N, 1, sizeof(int), typ, &typ);MPI_Type_commit(&typ);MPI_Barrier(MPI_COMM_WORLD);MPI_Recv(&(A[0][0]), 1, typ, 0, 0, MPI_COMM_WORLD, &st);printf("\n Transposed:\n");print_matrix();

}

MPI_Finalize();}

Creates a vector datatype of length N strided by a blocklength of 1

Datatype MPI_Type_hvector allows for on the fly transpose of the matrix

Page 696: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose – MPI (DEMO)

24

[LSU760000@n01 matrix_transpose]$ mpiexiec -np 2 ./mpi_mtrans

Source:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Transposed:0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Page 697: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition

• Matrix Transpose

• Gauss-Jordan Elimination

• LU Decomposition

• Summary Materials for Test

25

Page 698: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Linear Systems

3333232131

2323222121

1313212111

bxaxaxa

bxaxaxa

bxaxaxa

3

2

1

3

2

1

333231

232221

131211

b

b

b

x

x

x

aaa

aaa

aaa

Solve Ax=b, where A is an n n matrix andb is an n 1 column vector

www.cs.princeton.edu/courses/archive/fall07/cos323/

26

Page 699: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Fundamental operations:

1. Replace one equation with linear combination

of other equations

2. Interchange two equations

3. Re-label two variables

• Combine to reduce to trivial system

• Simplest variant only uses #1 operations but get better

stability by adding

– #2 or

– #2 and #3

www.cs.princeton.edu/courses/archive/fall07/cos323/

27

Page 700: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Solve:

• Can be represented as

• Goal: to reduce the LHS to an identity matrix resulting

with the solutions in RHS

1354

732

21

21

xx

xx

13

7

54

32

?

?

10

01

www.cs.princeton.edu/courses/archive/fall07/cos323/

28

Page 701: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Basic operation 1: replace any row by

linear combination with any other row :

replace row1 with 1/2 * row1 + 0 * row2

• Replace row2 with row2 – 4 * row1

• Negate row2

13

7

54

32

1354

1 27

23

110

1 27

23

110

1 27

23

www.cs.princeton.edu/courses/archive/fall07/cos323/

29

Row1 = (Row1)/2

Row2=Row2-(4*Row1)

Row2 = (-1)*Row2

Page 702: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Replace row1 with row1 – 3/2 * row2

• Solution:

x1 = 2, x2 = 1

110

1 27

23

1

2

10

01

www.cs.princeton.edu/courses/archive/fall07/cos323/

30

Row1 = Row1 – (3/2)* Row2

Page 703: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Pivoting

• Consider this system:

• Immediately run into problem: algorithm wants us to divide by zero!

• More subtle version:

• The pivot or pivot element is the element of a matrix which is

selected first by an algorithm to do computation

• Pivot entry is usually required to be at least distinct from zero, and

often distant from it

• Select largest element in matrix and swap columns and rows to

bring this element to the ‚right’ position: full (complete) pivoting

8

2

32

10

8

2

32

1001.0

www.cs.princeton.edu/courses/archive/fall07/cos323/

31

Page 704: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Pivoting

• Consider this system:

• Pivoting :– Swap rows 1 and 2:

– And continue to solve as shown before

1

8

10

23

1

2

10

01

110

1 38

32

www.cs.princeton.edu/courses/archive/fall07/cos323/

32

x1 = 2, x2 = 1

8

1

23

10

Page 705: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Pivoting:Example

• Division by small numbers round-off error in computer arithmetic

• Consider the following system0.0001x1 + x2 = 1.000

x1 + x2 = 2.000

• exact solution: x1=1.0001 and x2 = 0.9999

• say we round off after 3 digits after the decimal point

• Multiply the first equation by 104 and subtract it from the second equation

• (1 - 1)x1 + (1 - 104)x2 = 2 - 104

• But, in finite precision with only 3 digits:

– 1 - 104 = -0.9999 E+4 ~ -0.999 E+4

– 2 - 104 = -0.9998 E+4 ~ -0.999 E+4

• Therefore, x2 = 1 and x1 = 0 (from the first equation)

• Very far from the real solution!

0.0001 1

1 1

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

33

1

2

Page 706: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting

• Partial pivoting doesn‘t look for largest element in matrix,

but just for the largest element in the ‚current‘ column

• Swap rows to bring the corresponding row to ‚right‘

position

• Partial pivoting is generally sufficient to adequately

reduce round-off error.

• Complete pivoting is usually not necessary to ensure

numerical stability

• Due to the additional computations it introduces, it may

not always be the most appropriate pivoting strategy

34

http://www.amath.washington.edu/~bloss/amath352_lectures/

Page 707: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting• One can just swap rows

x1 + x2 = 2.000

0.0001x1 + x2 = 1.000

• Multiple the first equation by 0.0001 and subtract it from the second equation gives:

(1 - 0.0001)x2 = 1 - 0.0001

0.9999 x2 = 0.9999 => x2 = 1

and then x1 = 1

• Final solution is closer to the real solution.

• Partial Pivoting

– For numerical stability, one doesn’t go in order, but pick the next row in rows i to n that has the largest element in row i

– This row is swapped with row i (along with elements of the right hand side) before the subtractions

• the swap is not done in memory but rather one keeps an indirection array

• Total Pivoting

– Look for the greatest element ANYWHERE in the matrix

– Swap columns

– Swap rows

• Numerical stability is really a difficult field

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

35

Page 708: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting

36

http://www.amath.washington.edu/~bloss/amath352_lectures/

Page 709: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Special Cases

• Common special case:

• Tri-diagonal Systems :

– Only main diagonal & 1 above,1 below

– Solve using : Gauss-Jordan

• Lower Triangular Systems (L)

– Solve using : forward substitution

• Upper Triangular Systems (U)

– Solve using : backward substitution

4

3

2

1

4443

343332

232221

1211

00

0

0

00

b

b

b

b

aa

aaa

aaa

aa

4

3

2

1

44434241

333231

2221

11

0

00

000

b

b

b

b

aaaa

aaa

aa

a

11

1

1

a

bx

22

1212

2

a

xabx

33

2321313

3

a

xaxabx

5

4

3

2

1

55

4544

353433

25242322

1514131211

0000

000

00

0

b

b

b

b

b

a

aa

aaa

aaaa

aaaaa

55

5

5

a

bx

44

5454

4

a

xabx

www.cs.princeton.edu/courses/archive/fall07/cos323/

37

Page 710: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition

• Matrix Transpose

• Gauss-Jordan Elimination

• LU Decomposition

• Summary Materials for Test

38

Page 711: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Solving Linear Systems of Eq.

• Method for solving Linear Systems

– The need to solve linear systems arises in an estimated 75% of all scientific computing problems [Dahlquist 1974]

• Gaussian Elimination is perhaps the most well-known method

– based on the fact that the solution of a linear system is invariant under scaling and under row additions

• One can multiply a row of the matrix by a constant as long as one multiplies the corresponding element of the right-hand side by the same constant

• One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side

– Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:

?

?

?

?

?

x =

equation n-i has i unknowns, with

?

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

39

Page 712: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Gaussian Elimination1 1 1

1 -2 2

1 2 -1

0

4

2

x =

1 1 1

0 -3 1

0 1 -2

0

4

2

x =

1 1 1

0 -3 1

0 0 -5

0

4

10

x =

Subtract row 1 from rows 2 and 3

Multiple row 3 by 3 and add row 2

-5x3 = 10 x3 = -2

-3x2 + x3 = 4 x2 = -2

x1 + x2 + x3 = 0 x1 = 4

Solving equations in

reverse order (backsolving)

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

40

Page 713: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Gaussian Elimination

• The algorithm goes through the matrix from the top-left

corner to the bottom-right corner

• The ith step eliminates non-zero sub-diagonal elements

in column i, subtracting the ith row scaled by aji/aii from

row j, for j=i+1,..,n.

i

0

values already computed

values yet to be

updated

pivot row ito

be

ze

roe

d

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

41

Page 714: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Sequential Gaussian Elimination

Simple sequential algorithm

// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1

// for each row j below row ifor j = i+1 to n

// add a multiple of row i to row jfor k = i to n

A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

• Several “tricks” that do not change the spirit of the algorithm but

make implementation easier and/or more efficient

– Right-hand side is typically kept in column n+1 of the matrix and one speaks of an augmented matrix

– Compute the A(i,j)/A(i,i) term outside of the loop

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

42

Page 715: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Parallel Gaussian Elimination?

• Assume that we have one processor per matrix element

Reduction

to find the max aji

Broadcast

max aji needed to compute

the scaling factor

Compute

Independent computation

of the scaling factor

Broadcasts

Every update needs the

scaling factor and the

element from the pivot

row

Compute

Independent

computations

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

43

Page 716: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

LU Factorization

• Gaussian Elimination is simple but

– What if we have to solve many Ax = b systems for different values of b?• This happens a LOT in real applications

• Another method is the “LU Factorization” (LU Decomposition)

• Ax = b

• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper triangular matrix O(n3)

• Then Ax = b is written L U x = b

• Solve L y = b O(n2)

• Solve U x = y O(n2)

?

?

?

?

?

?

x =

?

?

?

?

?

?

x =

equation i has i unknowns equation n-i has i unknowns

triangular system solves are easy

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

44

Page 717: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

LU Factorization: Principle

• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.

• Magically, A = L x U !

• Should be done with pivoting as well

1 2 -1

4 3 1

2 2 3

1 2 -1

0 -5 5

2 2 3

gaussian

elimination

save the

scaling

factor

1 2 -1

4 -5 5

2 2 3

gaussian

elimination

+

save the

scaling

factor

1 2 -1

4 -5 5

2 -2 5

gaussian

elimination

+

save the

scaling

factor

1 2 -1

4 -5 5

2 2/5 3

1 0 0

4 1 0

2 2/5 1

L = 1 2 -1

0 -5 5

0 0 3

U =

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

45

Page 718: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

LU Factorization

stores the scaling factors

k

k

LU-sequential(A,n) {

for k = 0 to n-2 {

// preparing column k

for i = k+1 to n-1

aik -aik / akk

for j = k+1 to n-1

// Task Tkj: update of column j

for i=k+1 to n-1

aij aij + aik * akj

}

}

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

• We’re going to look at the simplest possible version

– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

46

Page 719: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

LU Factorization

• We’re going to look at the simplest possible version

– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

LU-sequential(A,n) {

for k = 0 to n-2 {

// preparing column k

for i = k+1 to n-1

aik -aik / akk

for j = k+1 to n-1

// Task Tkj: update of column j

for i=k+1 to n-1

aij aij + aik * akj

}

}

k

i

j

k

update

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

47

Page 720: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Parallel LU on a ring

• Since the algorithm operates by columns from left to right, we should

distribute columns to processors

• Principle of the algorithm

– At each step, the processor that owns column k does the “prepare” task

and then broadcasts the bottom part of column k to all others

• Annoying if the matrix is stored in row-major fashion

• Remember that one is free to store the matrix in anyway one wants, as long

as it’s coherent and that the right output is generated

– After the broadcast, the other processors can then update their data.

• Assume there is a function alloc(k) that returns the rank of the

processor that owns column k

– Basically so that we don’t clutter our program with too many global-to-

local index translations

• In fact, we will first write everything in terms of global indices, as to

avoid all annoying index arithmetic

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

48

Page 721: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

LU-broadcast algorithm

LU-broadcast(A,n) {

q MY_NUM()

p NUM_PROCS()

for k = 0 to n-2 {

if (alloc(k) == q)

// preparing column k

for i = k+1 to n-1

buffer[i-k-1] aik -aik / akk

broadcast(alloc(k),buffer,n-k-1)

for j = k+1 to n-1

if (alloc(j) == q)

// update of column j

for i=k+1 to n-1

aij aij + buffer[i-k-1] * akj

}

}

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

49

Page 722: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Dealing with local indices

• Assume that p divides n

• Each processor needs to store r=n/p columns and its

local indices go from 0 to r-1

• After step k, only columns with indices greater than k will

be used

• Simple idea: use a local index, l, that everyone initializes

to 0

• At step k, processor alloc(k) increases its local index so

that next time it will point to its next local column

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

50

Page 723: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

LU-broadcast algorithm

...

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (alloc(k) == q)

for i = k+1 to n-1

buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]

l l+1

broadcast(alloc(k),buffer,n-k-1)

for j = l to r-1

for i=k+1 to n-1

a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]

}

}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

51

Page 724: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Bad load balancing

P1 P2 P3 P4

already

done

already

done working

on it

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

52

Page 725: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Good Load Balancing?

working

on it

already

done

already

done

Cyclic distribution

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

53

Page 726: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Load-balanced program

...

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (k mod p == q)

for i = k+1 to n-1

buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]

l l+1

broadcast(alloc(k),buffer,n-k-1)

for j = l to r-1

for i=k+1 to n-1

a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]

}

}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

54

Page 727: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Performance Analysis

• How long does this code take to run?– This is not an easy question because there are many tasks and

many communications

• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)

– n-1 column preparations: (n2/2) w’ + O(1)

– column updates: (n3/3p) w + O(n2)

• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)

• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice

• How can we improve this algorithm?

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

55

Page 728: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Pipelining on the Ring

• So far, in the algorithm we’ve used a simple broadcast

• Nothing was specific to being on a ring of processors and it’s portable – in fact you could just write raw MPI that just looks like our

pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors

• But it’s not efficient– The n-1 communication steps are not overlapped with

computations

– Therefore Amdahl’s law, etc.

• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast

code we saw at the very beginning throughout the LU code

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

56

Page 729: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Previous program

...

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (k == q mod p)

for i = k+1 to n-1

buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]

l l+1

broadcast(alloc(k),buffer,n-k-1)

for j = l to r-1

for i=k+1 to n-1

a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]

}

}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

57

Page 730: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

LU-pipeline algorithm

double a[n-1][r-1];

q MY_NUM()

p NUM_PROCS()

l 0

for k = 0 to n-2 {

if (k == q mod p)

for i = k+1 to n-1

buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]

l l+1

send(buffer,n-k-1)

else

recv(buffer,n-k-1)

if (q ≠ k-1 mod p) send(buffer, n-k-1)

for j = l to r-1

for i=k+1 to n-1

a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]

}

}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

58

Page 731: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition

• Matrix Transpose

• Gauss-Jordan Elimination

• LU Decomposition

• Summary Materials for Test

59

Page 732: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

Summary : Material for the Test

• Matrix Transpose: Slides 17-23

• Gauss Jordan: Slides 26-30

• Pivoting: Slides 31-37

• Special Cases (forward & backward substitution): Slide 35

• LU Decomposition 44-58

60

Page 733: Paralel Computing

CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011

61