Upload
taner-erkan
View
146
Download
3
Tags:
Embed Size (px)
Citation preview
Parallel Computing
Teacher is Nurbek SaparkhojayevLecture#1: Introduction to Parallel Computing
Lecture#1 outline
Background
Why use parallel computing?
Who and What?
Concepts and Terminology
Parallel Computer Memory Architectures
BackgroundTraditionally, software has been written for serial computation:
* To be run on a single computer having a single Central Processing Unit (CPU);
* A problem is broken into a discrete series of instructions. * Instructions are executed one after another.
* Only one instruction may execute at any moment in time.
Cont.
In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:
* To be run using multiple CPUs * A problem is broken into discrete parts that can be solved concurrently
* Each part is further broken down to a series of instructions * Instructions from each part execute simultaneously on different CPUs
Parallel Computing
The compute resources can include: * A single computer with multiple processors;
* An arbitrary number of computers connected by a network; * A combination of both.
The computational problem usually demonstrates characteristics such as the ability to be:
* Broken apart into discrete pieces of work that can be solved simultaneously; * Execute multiple program instructions at any moment in time;
* Solved in less time with multiple compute resources than with a single compute resource.
The Universe is Parallel:
Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. For
example: * Galaxy formation
* Planetary movement * Weather and ocean patterns
* Tectonic plate drift * Rush hour traffic
* Automobile assembly line * Building a space shuttle
* Ordering a hamburger at the drive through.
The Real World is Massively Parallel
Uses for Parallel Computing:
Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult scientific and engineering
problems found in the real world. Some examples: * Atmosphere, Earth, Environment
* Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics
* Bioscience, Biotechnology, Genetics * Chemistry, Molecular Sciences
* Geology, Seismology * Mechanical Engineering - from prosthetics to spacecraft * Electrical Engineering, Circuit Design, Microelectronics
* Computer Science, Mathematics
Some nice photos
Different applications
Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing
of large amounts of data in sophisticated ways. For example: * Databases, data mining
* Oil exploration * Web search engines, web based business services
* Medical imaging and diagnosis * Pharmaceutical design
* Management of national and multi-national corporations * Financial and economic modeling
* Advanced graphics and virtual reality, particularly in the entertainment industry
* Networked video and multi-media technologies * Collaborative work environments
Nice photos
Why use Parallel computing?Main Reasons:
a. - Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel clusters can be built from cheap,
commodity components.
cont. b. - Solve larger problems: Many problems are so large and/or complex that it is impractical
or impossible to solve them on a single computer, especially given limited computer memory. For example:
* "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources.
* Web search engines/databases processing millions of transactions per second
cont. c. - Provide concurrency: A single compute resource can only do one thing at a time.
Multiple computing resources can be doing many things simultaneously. For example, the Access Grid (www.accessgrid.org) provides a global collaboration network where
people from around the world can meet and conduct work "virtually".
cont.d. - Use of non-local resources: Using compute resources on a wide area network,
or even the Internet when local compute resources are scarce. For example: * SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for a
compute power over 528 TeraFLOPS (as of August 04, 2008) * Folding@home (folding.stanford.edu) uses over 340,000 computers for a
compute power of 4.2 PetaFLOPS (as of November 4, 2008)
cont.e. - Limits to serial computing: Both physical and practical reasons pose significant
constraints to simply building ever faster serial computers: * Transmission speeds - the speed of a serial computer is directly dependent upon
how fast data can move through hardware. Absolute limits are the speed of light (30 cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond).
Increasing speeds necessitate increasing proximity of processing elements. * Limits to miniaturization - processor technology is allowing an increasing number
of transistors to be placed on a chip. However, even with molecular or atomic-level components, a limit will be reached on how small components can be.
* Economic limitations - it is increasingly expensive to make a single processor faster. Using a larger number of moderately fast commodity processors to achieve
the same (or better) performance is less expensive. Decision: Current computer architectures are increasingly relying upon hardware
level parallelism to improve performance: * Multiple execution units
* Pipelined instructions * Multi-core
Who and What?
Top500.org provides statistics on parallel computing users - the charts below are just a sample. Some things to note:
* Sectors may overlap - for example, research may be classified research. Respondents have to choose between the two.
* "Not Specified" is by far the largest application - probably means multiple applications.
Who's doing Parallel Computing?
Future
The Future: * During the past 20 years, the trends indicated by ever
faster networks, distributed systems, and multi-processor computer architectures (even at the desktop level) clearly
show that parallelism is the future of computing.
Concepts and Terminologyvon Neumann Architecture
* Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers.
* Since then, virtually all computers have followed this basic design, which differed from earlier computers programmed through "hard wiring".
4 main components:1. Memory
2. Control Unit3. Arithmetic Logic Unit
4.Input/Output * Read/write, random access memory is used to store both program instructions and
data o Program instructions are coded data which tell the computer to do something
o Data is simply information to be used by the program * Control unit fetches instructions/data from memory, decodes the instructions and then
sequentially coordinates operations to accomplish the programmed task. * Aritmetic Unit performs basic arithmetic operations * Input/Output is the interface to the human operator
Von Neumann architecture
Flynn's Classical TaxonomyThere are different ways to classify parallel computers. One of the more widely
used classifications, in use since 1966, is called Flynn's Taxonomy.Flynn's taxonomy distinguishes multi-processor computer architectures
according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only
one of two possible states: Single or Multiple.There are 4 possible classifications according to Flynn:
SISDSIMDMISDMIMD
Flynn's Classical Taxonomy-SISD
Single Instruction, Single Data (SISD): * A serial (non-parallel) computer
* Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle
* Single data: only one data stream is being used as input during any one clock cycle
* Deterministic execution * This is the oldest and even today, the most common type of computer
* Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
SISD
SIMD
Single Instruction, Multiple Data (SIMD):
* A type of parallel computer * Single instruction: All processing units execute the same instruction at any given
clock cycle * Multiple data: Each processing unit can operate on a different data element
* Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.
* Synchronous (lockstep) and deterministic execution * Two varieties: Processor Arrays and Vector Pipelines
* Examples: o Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV o Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2,
Hitachi S820, ETA10 * Most modern computers, particularly those with graphics processor units (GPUs)
employ SIMD instructions and execution units.
SIMD
SIMD
Multiple Instruction, Single Data (MISD):
* A single data stream is fed into multiple processing units. * Each processing unit operates on the data independently via independent instruction
streams. * Few actual examples of this class of parallel computer have ever existed. One is the
experimental Carnegie-Mellon C.mmp computer (1971). * Some conceivable uses might be:
o multiple frequency filters operating on a single signal stream o multiple cryptography algorithms attempting to crack a single coded message.
MISD
Multiple Instruction, Multiple Data (MIMD)
* Currently, the most common type of parallel computer. Most modern computers fall into this category.
* Multiple Instruction: every processor may be executing a different instruction stream
* Multiple Data: every processor may be working with a different data stream * Execution can be synchronous or asynchronous, deterministic or non-
deterministic * Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs. * Note: many MIMD architectures also include SIMD execution sub-
components
MIMD
Some General Parallel TerminologyTask- A logically discrete section of computational work. A task is typically a program or
program-like set of instructions that is executed by a processor.Parallel Task- A task that can be executed by multiple processors safely (yields correct
results)Serial Execution - Execution of a program sequentially, one statement at a time. In the
simplest sense, this is what happens on a one processor machine. However, virtually all parallel tasks will have sections of a parallel program that must be executed
serially.Parallel Execution- Execution of a program by more than one task, with each task being
able to execute the same or different statement at the same moment in time.Pipelining - Breaking a task into steps performed by different processor units, with inputs
streaming through, much like an assembly line; a type of parallel computing.Shared Memory - From a strictly hardware point of view, describes a computer
architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same
logical memory locations regardless of where the physical memory actually exists.
TerminologySymmetric Multi-Processor (SMP) -Hardware architecture where multiple
processors share a single address space and access to all resources; shared memory computing.
Distributed Memory - In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can
only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing.
Communications - Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory
bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method
employed.Synchronization - The coordination of parallel tasks in real time, very often
associated with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase.
Terminology
Granularity- In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
* Coarse: relatively large amounts of computational work are done between communication events
* Fine: relatively small amounts of computational work are done between communication events Observed Speedup
Observed speedup of a code which has been parallelized, defined as:
wall-clock time of serial execution -----------------------------------
wall-clock time of parallel execution
One of the simplest and most widely used indicators for a parallel program's performance.
TerminologyParallel Overhead- The amount of time required to coordinate parallel tasks, as opposed to doing
useful work. Parallel overhead can include factors such as: * Task start-up time * Synchronizations
* Data communications * Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
* Task termination time Massively Parallel- Refers to the hardware that comprises a given parallel system - having many
processors. The meaning of "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors numbering in the hundreds of thousands.
Embarrassingly Parallel- Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks.
Scalability- Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that
contribute to scalability include: * Hardware - particularly memory-cpu bandwidths and network communications
* Application algorithm * Parallel overhead related
* Characteristics of your specific application and coding Multi-core Processors- Multiple processors (cores) on a single chip.
Cluster Computing-Use of a combination of commodity units (processors, networks or SMPs) to build a parallel system.
Supercomputing / High Performance Computing- Use of the world's fastest, largest machines to solve large problems.
Parallel Computer Memory Architectures
a. Shared Memory- General Characteristics: * Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address
space. * Multiple processors can operate independently but share the same memory
resources. * Changes in a memory location effected by one processor are visible to all
other processors. * Shared memory machines can be divided into two main classes based upon
memory access times: UMA and NUMA.
UMAUniform Memory Access (UMA):
* Most commonly represented today by Symmetric Multiprocessor (SMP) machines
* Identical processors * Equal access and access times to memory
* Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the
other processors know about the update. Cache coherency is accomplished at the hardware
level.
UMA
Non-Uniform Memory Access (NUMA):
* Often made by physically linking two or more SMPs * One SMP can directly access memory of another SMP
* Not all processors have equal access time to all memories * Memory access across link is slower
* If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA
NUMA
Advantages & DisadvantagesAdvantages:
* Global address space provides a user-friendly programming perspective to memory
* Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages:
* Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management.
* Programmer responsibility for synchronization constructs that ensure "correct" access of global memory.
* Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of
processors.
Distributed MemoryGeneral Characteristics:
Like shared memory systems, distributed memory systems vary widely but share a common characteristic.
Distributed memory systems require a communication network to connect inter-processor memory.
Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address
space across all processors.Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is
communicated. Synchronization between tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can can be as simple as Ethernet.
Distributed Memory
Distributed Memory
Advantages: Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference
and without the overhead incurred with trying to maintain cache coherency.Cost effectiveness: can use commodity, off-the-shelf processors and
networking. Disadvantages:
The programmer is responsible for many of the details associated with data communication between processors.
It may be difficult to map existing data structures, based on global memory, to this memory organization.
Non-uniform memory access (NUMA) times
Hybrid Distributed-Shared Memory
The largest and fastest computers in the world today employ both shared and distributed memory architectures.
The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global.
The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP.
Therefore, network communications are required to move data from one SMP to another.
Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the
foreseeable future.Advantages and Disadvantages: whatever is common to both shared and
distributed memory architectures.
The end of the first lecture!!
QUESTIONS? Comments? Requests?
Parallel Computing
Teacher is Nurbek SaparkhojayevLecture#2:Parallel Programming Models
Models
There are several parallel programming models in common use: o Shared Memory
o Threads o Message Passing
o Data Parallel o Hybrid
Parallel programming models exist as an abstraction above hardware and memory architectures.
Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples:
1st Model
1. Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach.
Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory". Note: although KSR is no longer in
business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future.
2nd Model
2. Message passing model on a shared memory machine: MPI on SGI Origin.
The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However,
the ability to send and receive messages with MPI, as is commonly done over a network of distributed memory machines, is not only implemented but is very
commonly used.
* Which model to use is often a combination of what is available and personal choice. There is no "best" model, although there certainly are better
implementations of some models over others.
* The following sections describe each of the models mentioned above, and also discuss some of their actual implementations.
Shared Memory Model(detailed)
In the shared-memory programming model, tasks share a common address space, which they read and write asynchronously.
Various mechanisms such as locks / semaphores may be used to control access to the shared memory.
An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the
communication of data between tasks. Program development can often be simplified.
An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality.
Keeping data local to the processor that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors
use the same data. Unfortunately, controlling data locality is hard to understand and beyond
the control of the average user.
Shared Memory Model(detailed)
Implementations:
On shared memory platforms, the native compilers translate user program variables into actual memory addresses, which are global.
No common distributed memory platform implementations currently exist. However, as mentioned previously in the Overview section, the KSR
ALLCACHE approach provided a shared memory view of data even though the physical memory of the machine was distributed.
Threads Model
In the threads model of parallel programming, a single process can have multiple, concurrent
execution paths.Perhaps the most simple analogy that can be
used to describe threads is the concept of a single program that includes a number of subroutines:
Threads Model
Threads Model(Code)
The main program a.out is scheduled to run by the native operating system. a.out loads and acquires all of the necessary system and user resources to run.
a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently.
Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread. Each thread also benefits from a global memory view because it
shares the memory space of a.out.
A thread's work may best be described as a subroutine within the main program. Any thread can execute any subroutine at the same time as other
threads.
Threads Model
Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more
than one thread is not updating the same global address at any time.
Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed.
Threads are commonly associated with shared memory architectures and operating systems.
Threads Implementations:
From a programming perspective, threads implementations commonly comprise:
A library of subroutines that are called from within parallel source code A set of compiler directives imbedded in either serial or parallel source
code
In both cases, the programmer is responsible for determining all parallelism.
Threaded implementations are not new in computing. Historically, hardware vendors have implemented their own proprietary versions of threads. These implementations differed substantially from each other making it difficult for
programmers to develop portable threaded applications.
Threads Implementations:Unrelated standardization efforts have resulted in two very different
implementations of threads: POSIX Threads and OpenMP.# POSIX Threads
* Library based; requires parallel coding * Specified by the IEEE POSIX 1003.1c standard (1995).
* C Language only * Commonly referred to as Pthreads.
* Most hardware vendors now offer Pthreads in addition to their proprietary threads implementations.
* Very explicit parallelism; requires significant programmer attention to detail. # OpenMP
* Compiler directive based; can use serial code * Jointly defined and endorsed by a group of major computer hardware and software vendors. The OpenMP Fortran API was released October 28, 1997.
The C/C++ API was released in late 1998. * Portable / multi-platform, including Unix and Windows NT platforms
* Available in C/C++ and Fortran implementations * Can be very easy and simple to use - provides for "incremental parallelism"
# Microsoft has its own implementation for threads, which is not related to the UNIX POSIX standard or OpenMP.
More Information:
POSIX Threads tutorial: computing.llnl.gov/tutorials/pthreads
OpenMP tutorial: computing.llnl.gov/tutorials/openMP
TerminologyPerformance: A quantifiable measure of rate of doing (computational) work
Multiple such measures of performance
Delineated at the level of the basic operation
ops – operations per second
ips – instructions per second
flops – floating operations per second
Rate at which a benchmark program takes to execute
A carefully crafted and controlled code used to compare systems
Linpack Rmax (Linpack flops)
gups (billion updates per second)
others
Two perspectives on performance
Peak performance Maximum theoretical performance possible for a system
Sustained performance Observed performance for a particular workload and run; Varies across workloads and possibly between runs
ScalabilityThe ability to deliver proportionally greater sustained performance through increased
system resources
Strong Scaling Fixed size application problem. Application size remains constant with increase in system size
Weak Scaling Variable size application problem. Application size scales proportionally with system size
Capability computing – In most pure form: strong scaling. Marketing claims tend toward this class
Capacity computing - Throughput computing. Includes job-stream workloads. In most simple form: weak scaling
Cooperative computing
Interacting and coordinating concurrent processes
Not a widely used term
Also: coordinated computing
The end of the first half of 2nd Lecture
Questions? Comments? Requests?
Parallel Computing
Teacher is Nurbek SaparkhojayevLecture#2:Parallel Programming Models
Message Passing Model
Message Passing Model
The message passing model demonstrates the following characteristics:# A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number
of machines.
# Tasks exchange data through communications by sending and receiving messages.
# Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive
operation.
Implementations: * From a programming perspective, message passing impl's commonly
comprise a library of subroutines that are imbedded in source code. The programmer is responsible for determining all parallelism.
* Historically, a variety of message passing libraries have been available since the 1980s. These implementations differed substantially from each other making it difficult for programmers to develop portable applications.
* In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.
* Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2 (MPI-2) was released in 1996. Both MPI specifications are available on the
web at http://www-unix.mcs.anl.gov/mpi/. * MPI is now the "de facto" industry standard for message passing, replacing
virtually all other message passing implementations used for production work. Most, if not all of the popular parallel computing platforms offer at least
one implementation of MPI. A few offer a full implementation of MPI-2. * For shared memory architectures, MPI implementations usually don't use a
network for task communications. Instead, they use shared memory (memory copies) for performance reasons.
More Info
MPI tutorial: computing.llnl.gov/tutorials/mpi
Data Parallel ModelThe data parallel model demonstrates the following characteristics: *
o Most of the parallel work focuses on performing operations on a data set. The data set is typically organized into a common structure, such as an
array or cube.
o A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.
o Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".
* On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the
data structure is split up and resides as "chunks" in the local memory of each task.
Data Parallel Model
Implementations:
Programming with the data parallel model is usually accomplished by writing a program with data parallel constructs. The constructs can be calls to a data
parallel subroutine library or, compiler directives recognized by a data parallel compiler.
Fortran 90 and 95 (F90, F95): ISO/ANSI standard extensions to Fortran 77. * Contains everything that is in Fortran 77
* New source code format; additions to character set * Additions to program structure and commands * Variable additions - methods and arguments
* Pointers and dynamic memory allocation added * Array processing (arrays treated as objects) added
* Recursive and new intrinsic functions added * Many other new features
Implementations are available for most common parallel platforms.
HPF
# High Performance Fortran (HPF): Extensions to Fortran 90 to support data parallel programming.
* Contains everything in Fortran 90 * Directives to tell compiler how to distribute data added
* Assertions that can improve optimization of generated code added * Data parallel constructs added (now part of Fortran 95)
HPF compilers were common in the 1990s, but are no longer commonly implemented.
# Compiler Directives: Allow the programmer to specify the distribution and alignment of data. Fortran implementations are available for most common
parallel platforms.# Distributed memory implementations of this model usually have the compiler
convert the program into standard code with calls to a message passing library (MPI usually) to distribute the data to all the processes. All message
passing is done invisibly to the programmer.
Other Models
Other parallel programming models besides those previously mentioned certainly exist, and will
continue to evolve along with the ever changing world of computer hardware and software. Only three of the more common ones are mentioned
here.
Hybrid
# In this model, any two or more parallel programming models are combined.
# Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with either the threads model (POSIX threads) or the shared memory model (OpenMP). This hybrid model lends itself well to
the increasingly common hardware environment of networked SMP machines.
# Another common example of a hybrid model is combining data parallel with message passing. As mentioned in the data parallel model section
previously, data parallel implementations (F90, HPF) on distributed memory architectures actually use message passing to transmit data between tasks,
transparently to the programmer.
Single Program Multiple Data (SPMD)
SPMD
SPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. # A
single program is executed by all tasks simultaneously.
# At any moment in time, tasks can be executing the same or different instructions within the same program.
# SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. That is, tasks do not necessarily have
to execute the entire program - perhaps only a portion of it.
# All tasks may use different data
Multiple Program Multiple Data (MPMD)
MPMD
Like SPMD, MPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming
models. MPMD applications typically have multiple executable object files (programs).
While the application is being run in parallel, each task can be executing the same or different program as other tasks.
# All tasks may use different data
Parallel Computing
Teacher is Nurbek Saparkhojayev
Lecture#3:Designing Parallel Programs
Automatic vs. Manual Parallelization
Designing and developing parallel programs has characteristically been a very manual process.
The programmer is typically responsible for both identifying and actually implementing parallelism.
Very often, manually developing parallel codes is a time consuming, complex, error-prone and
iterative process.
For a number of years now, various tools have been available to assist the programmer with
converting serial programs into parallel programs.
The most common type of tool used to automatically parallelize a serial program is a parallelizing
compiler or pre-processor.
A parallelizing compiler generally works in two different ways:
1. Fully Automatic
The compiler analyzes the source code and identifies opportunities for parallelism.
The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on
whether or not the parallelism would actually improve performance.
Loops (do, for) loops are the most frequent target for automatic parallelization.
Automatic vs. Manual Parallelization
2. Programmer Directed
Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the
compiler how to parallelize the code.
May be able to be used in conjunction with some degree of automatic parallelization also.
If you are beginning with an existing serial code and have time or budget constraints, then automatic
parallelization may be the answer.
However, there are several important caveats that apply to automatic parallelization:
* Wrong results may be produced
* Performance may actually degrade
* Much less flexible than manual parallelization
* Limited to a subset (mostly loops) of code
* May actually not parallelize code if the analysis suggests there are inhibitors or the code
is too complex
The remainder of this section applies to the manual method of developing parallel codes.
Understand the Problem and the
Program
1.Understand the problem you are trying to solve.
2. Think about the option of parallelizing this problem. Can you parallel this problem or not?
Example of Parallelizable Problem:
Calculate the potential energy for each of several thousand independent conformations of a
molecule. When done, find the minimum energy conformation.
This problem is able to be solved in parallel. Each of the molecular conformations is independently
determinable. The calculation of the minimum energy conformation is also a parallelizable problem.
# Example of a Non-parallelizable Problem:
Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of the formula:
F(n) = F(n-1) + F(n-2)
This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown
would entail dependent calculations rather than independent ones. The calculation of the F(n) value
uses those of both F(n-1) and F(n-2). These three terms cannot be calculated independently and
therefore, not in parallel.
Understand the Problem and the
Program
3. Identify the program's hotspots:
Know where most of the real work is being done. The majority of scientific and technical
programs usually accomplish most of their work in a few places.
Profilers and performance analysis tools can help here
Focus on parallelizing the hotspots and ignore those sections of the program that account for little
CPU usage.
4. Identify bottlenecks in the program
Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred?
For example, I/O is usually something that slows a program down.
May be possible to restructure the program or use a different algorithm to reduce or eliminate
unnecessary slow areas
5. Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as
demonstrated by the Fibonacci sequence above.
6. Investigate other algorithms if possible. This may be the single most important consideration when
designing a parallel application.
Partitioning
One of the first steps in designing a parallel program is to break the problem into
discrete "chunks" of work that can be distributed to multiple tasks. This is known
as decomposition or partitioning.
There are two basic ways to partition computational work among parallel tasks:
domain decomposition and functional decomposition.
However, combining these two types of problem decomposition is common and
natural.
a. Domain Decomposition
In this type of partitioning, the data associated with a problem is decomposed. Each parallel task
then works on a portion of of the data.
a. Domain DecompositionThere are different ways to partition data:
b. Functional Decomposition
In this approach, the focus is on the computation that is to be performed rather than on the data
manipulated by the computation. The problem is decomposed according to the work that must be
done. Each task then performs a portion of the overall work.
b. Functional Decomposition
Functional decomposition lends itself well to problems that can be split into different tasks. For
example:
1. Ecosystem Modeling
Each program calculates the population of a given group, where each group's growth depends on
that of its neighbors. As time progresses, each process calculates its current state, then exchanges
information with the neighbor populations. All tasks then progress to calculate the state at the next
time step.
b. Functional Decomposition2. Signal Processing:
An audio signal data set is passed through four distinct computational filters. Each filter is a separate
process. The first segment of data must pass through the first filter before progressing to the second.
When it does, the second segment of data passes through the first filter. By the time the fourth
segment of data is in the first filter, all four tasks are busy.
b. Functional Decomposition3. Climate Modeling
Each model component can be thought of as a separate task. Arrows represent exchanges of data
between components during computation: the atmosphere model generates wind velocity data that
are used by the ocean model, the ocean model generates sea surface temperature data that are
used by the atmosphere model, and so on.
Communications
Who Needs Communications?
The need for communications between tasks depends upon your problem:
You DON'T need communications:
- Some types of problems can be decomposed and executed in parallel with virtually no need
for tasks to share data. For example, imagine an image processing operation where every pixel in a
black and white image needs to have its color reversed. The image data can easily be distributed to
multiple tasks that then act independently of each other to do their portion of the work.
- These types of problems are often called embarrassingly parallel because they are so
straight-forward. Very little inter-task communication is required.
You DO need communications
- Most parallel applications are not quite so simple, and do require tasks to share data with
each other. For example, a 3-D heat diffusion problem requires a task to know the temperatures
calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect
on that task's data.
Factors to Consider:
There are a number of important factors to consider when designing your program's inter-task
communications:
Cost of communications
- Inter-task communication virtually always implies overhead.
- Machine cycles and resources that could be used for computation are instead
used to package and transmit data.
- Communications frequently require some type of synchronization between
tasks, which can result in tasks spending time "waiting" instead of doing work.
- Competing communication traffic can saturate the available network
bandwidth, further aggravating performance problems.
Latency vs. Bandwidth
Latency is the time it takes to send a minimal (0 byte) message from point A to point B.
Commonly expressed as microseconds.
Bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed
as megabytes/sec or gigabytes/sec.
Sending many small messages can cause latency to dominate communication overheads. Often
it is more efficient to package small messages into a larger message, thus increasing the effective
communications bandwidth.
Factors to considerVisibility of communications
With the Message Passing Model, communications are explicit and generally quite visible and under
the control of the programmer.
With the Data Parallel Model, communications often occur transparently to the programmer,
particularly on distributed memory architectures. The programmer may not even be able to know exactly
how inter-task communications are being accomplished.
Synchronous vs. asynchronous communications
Synchronous communications require some type of "handshaking" between tasks that are sharing
data. This can be explicitly structured in code by the programmer, or it may happen at a lower level
unknown to the programmer.
Synchronous communications are often referred to as blocking communications since other work
must wait until the communications have completed.
Asynchronous communications allow tasks to transfer data independently from one another. For
example, task 1 can prepare and send a message to task 2, and then immediately begin doing other
work. When task 2 actually receives the data doesn't matter.
Asynchronous communications are often referred to as non-blocking communications since other
work can be done while the communications are taking place.
Interleaving computation with communication is the single greatest benefit for using asynchronous
communications.
Factors to consider
Scope of communications
Knowing which tasks must communicate with each other is critical during the design stage of a
parallel code. Both of the two scopings described below can be implemented synchronously or
asynchronously.
Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the
other acting as the receiver/consumer.
Collective - involves data sharing between more than two tasks, which are often specified as
being members in a common group, or collective. Some common variations (there are more):
Factors to consider
Factors to consider
Efficiency of communications
Very often, the programmer will have a choice with regard to factors that can affect
communications performance. Only a few are mentioned here.
Which implementation for a given model should be used? Using the Message Passing Model as
an example, one MPI implementation may be faster on a given hardware platform than another.
What type of communication operations should be used? As mentioned previously, asynchronous
communication operations can improve overall program performance.
Network media - some platforms may offer more than one network for communications. Which
one is best?
Overhead and Complexity
Synchronization
Types of Synchronization:
1. Barrier- Usually implies that all tasks are involved
Each task performs its work until it reaches the barrier. It then stops, or "blocks".
When the last task reaches the barrier, all tasks are synchronized.
What happens from here varies. Often, a serial section of work must be done. In other cases,
the tasks are automatically released to continue their work.
2. Lock / semaphore - Can involve any number of tasks
Typically used to serialize (protect) access to global data or a section of code. Only one task
at a time may use (own) the lock / semaphore / flag.
The first task to acquire the lock "sets" it. This task can then safely (serially) access the
protected data or code.
Other tasks can attempt to acquire the lock but must wait until the task that owns the lock
releases it. Can be blocking or non-blocking
3. Synchronous communication operations
Involves only those tasks executing a communication operation
When a task performs a communication operation, some form of coordination is required with
the other task(s) participating in the communication. For example, before a task can perform a send
operation, it must first receive an acknowledgment from the receiving task that it is OK to send.
Discussed previously in the Communications section
The end of the lecture
Questions?Comments?
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
PARALLEL COMPUTER ARCHITECTURE
Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityJanuary 20, 2011
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
2
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
3
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
4
Opening Remarks
• This lecture is an introduction to supercomputer architecture– Major parameters, classes, and system level
• Architecture exploits device technology to deliver its innate computation performance potential– Structures and system organization – Semantics of operation and memory (instruction set architecture, ISA)
• Between device technology and architecture is circuit design– Circuit design converts devices to logic gates and higher level logical
structures (e.g. multiplexers, adders)– but this is outside the scope of this course.
• We will assume basic logic abstraction with characterizing properties:– Functional behavior (the logical operation it performs)– Switching speed– Propagation delay or latency– Size and power
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
HPC System Stack
5
Science Problems : Environmental Modeling, Physics, Computational Chemistry, etc.Application : Coastal Modeling, Black hole simulations, etc.
Algorithms : PDE, Gaussian Elimination, 12 Dwarves, etc.
Program Source Code
Programming Languages: Fortran, C, C++ , UPC, Fortress, X10, etc.Compilers : Intel C/C++/Fortran Compilers, PGI C/C++/Fortran, IBM XLC, XLC++, XLF, etc.Runtime Systems : Java Runtime, MPI etc.
Operating Systems : Linux, Unix, AIX etc.
Systems Architecture : Vector, SIMD array, MPP, Commodity Cluster
Firmware : Motherboard chipset, BIOS, NIC drivers,
Microarchitectures : Intel/AMD x86, SUN SPARC, IBM Power 5/6
Logic Design : RTL
Circuit Design : ASIC, FPGA, Custom VLSI
Device Technology : NMOS, CMOS, TTL, Optical
Mod
el o
f C
om
pu
tatio
n
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
6
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
7
Performance Factors: Technology Speed
• Latencies– Logic latency time– Processor to memory access latency– Memory access time– Network latency
• Cycle Times– Logic switching speed– On-chip clock speed (clock cycle time)– Memory cycle time
• Throughput– On-chip data transfer rate– Instructions per cycle– Network data rate
• Granularity– Logic density– Memory density– Task Size– Packet Size
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
8
Machine Parameters affecting Performance
• Peak floating point performance• Main memory capacity• Bi-section bandwidth• I/O bandwidth• Secondary storage capacity• Organization
– Class of system– # nodes– # processors per node– Accelerators– Network topology
• Control strategy– MIMD– Vector, PVP– SIMD– SPMD
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
9
Performance Factors: Parallelism
• Fully independent processing elements operating concurrently on separate tasks– Coarse grained – Communicating Sequential Processes (CSP), – Single Program Multiple Data stream (SPMD)
• Instruction Level Parallelism (ILP)– Fine grained – Single instruction performs multiple operations
• Pipelining– Fine grained– Overlapping sequential operations in execution pipeline– Vector pipelines
• SIMD operations– Fine / Medium grained– Single Instruction stream, Multiple Data stream – ALU arrays
• Overlapping of computation and communication– Fine / Medium grained– Asynchronous– Prefetching
• Multithreading– Medium grained– Separate instruction streams serve single processor
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
10
Sources of Performance Degradation (SLOW)
• Starvation– Not enough work to do among distributed resources– Insufficient parallelism– Inadequate load balancing– e.g. : Amdahl's law
• Latency– Time required for response of access to remote data or services– Waiting for access to memory or other parts of the system– e.g. : Local memory access, Network communication
• Overhead– Extra work that has to be done to manage program concurrency and parallel
resources the real work you want to perform– Critical-path work for management of concurrent tasks and parallel resources not
required for sequential execution– e.g. : Synchronization and scheduling
• Waiting for Contention– Delays due to conflicts for use of shared resources.– e.g. : Memory bank conflicts, shared network channels
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
11
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
12
Computer Architecture
• Structure– Functional elements– Organization and balance– Interconnect and Data flow paths
• Semantics– Meaning of the logical constructs– Primitive data types– Manifest as Instruction Set Architecture abstract layer
• Mechanisms– Primitive functions that are usually implemented in hardware or
sometimes firmware– Determines preferred actions and sequences– Enables efficiency and scalability
• Policy– Approach and priorities to accomplishing a goal– e.g., cache replacement policy
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
13
Structure
• Functional elements– The form of functional elements made up of more primitive
logical modules– e.g. vector arithmetic unit comprising a pipeline of simple stages
• Organization and balance– Number of major elements of different types– Hierarchy of collections of elements
• Data flow– Interconnection of functional, state, and communication
elements– Control of dataflow paths determines actions of processor and
system
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
14
Semantics
• Meaning of the logical constructs– Basic operations that can be performed on data
• Primitive data types– What collections of bits (e.g. word) means– Defines actions that can be performed on binary strings
• Instruction Set Architecture– Defined set of actions that can be performed and data object on
which they can be applied– Encoding of binary strings to represent distinct instructions
• Parallel control constructs– Hardware implemented : vector operations, – Software implemented : MPI libraries
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
15
Mechanisms• Primitive functions that are usually implemented in
hardware or sometimes firmware– Lower level than instruction set operations– Multiple such mechanisms contribute to execution of
operation
• Determines preferred actions and sequences– Usually time effective primitives– Usually widely used by many instructions
• Enables efficiency and scalability– Establishes basic performance properties of machine
• Examples– Basic arithmetic and logic unit functions– Thread context switching– TLB (Translation Lookaside Buffer) address translation– Cache line replacement– Branch prediction
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
16
Policy• Hardware architecture policies
– Decision of ordering or allocation dependent on criteria– Not all machine decisions are visible to the ISA of the system– Not all machine choices are available to the name space of the
operands– Examples
• Cache structure, size, and speed • Cache replacement policies• Order of operation execution• Branch prediction• Allocation of shared resources• Network routers
• Software system management policies– Scheduling,– Data allocation : partitioning of a problem
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
17
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
18
Parallel Structures & Performance Issues
• Pipelining– Vector processing– Execution pipeline– Performance Issues:
• Pipelining increases throughput : More operations per unit time• Pipelining increases latency time : Operation on single operand pair
can take longer than non-pipelined functional unit
• Multiple Arithmetic Units– Instruction level parallelism– Systolic arrays– Performance Issues:
• Increases peak performance• Requires application instruction level parallelism• Average usually significantly lower than peak
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
19
Parallel Structures & Performance Issues
• Multiple processors– MIMD: Separate control– SIMD: Single controller– Multicore– Accelerators
• Performance Issues: Multiple processors require overhead operations– Synchronization– Communications – Possibly cache Coherence
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
20
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
21
Scalability• The ability to deliver proportionally greater sustained performance through
increased system resources• Strong Scaling
– Fixed size application problem– Application size remains constant with increase in system size
• Weak Scaling– Variable size application problem– Application size scales proportionally with system size
• Capability computing– in most pure form: strong scaling– Marketing claims tend toward this class
• Capacity computing– Throughput computing– Includes job-stream workloads
– In most simple form: weak scaling
• Cooperative computing– Interacting and coordinating concurrent processes– Not a widely used term– Also: “coordinated computing”
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
22
Performance Metrics
• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput
– Average performance over a period of time– flops, Mflops, Gflops, Tflops, Pflops – flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, ops, Mops …
• Cycles per instruction– cpi – Alternatively: instructions per cycle, ipc
• Memory access latency– cycles per second
• Memory access bandwidth– bytes per second (Bps)– bits per second (bps)– or Gigabytes per second, GBps, GB/s
• Bi-section bandwidth– bytes per second
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
23
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Basic Uni-processor Architecture elements
• I/O Interface
• Memory Interface
• Cache hierarchy
• Register Sets
• Control
• Execution pipeline
• Arithmetic Logic Units
24
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
25
Multiprocessor• A general class of system• Integrates multiple processors in to an interconnected ensemble• MIMD: Multiple Instruction Stream Multiple Data Stream• Different memory models
– Distributed memory• Nodes support separate address spaces
– Shared memory• Symmetric multiprocessor• UMA – uniform memory access• Cache coherent
– Distributed shared memory• NUMA – non uniform memory access• Cache coherent
– PGAS• Partitioned global address space• NUMA• Not cache coherence
– Hybrid : Ensemble of distributed shared memory nodes• Massively Parallel Processor, MPP
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
26
Massively Parallel Processor
• MPP• General class of large scale multiprocessor• Represents largest systems
– IBM BG/L– Cray XT3
• Distinguished by memory strategy– Distributed memory– Distributed shared memory
• Cache coherent• Partitioned global address space
• Custom interconnect network• Potentially heterogeneous
– May incorporate accelerator to boost peak performance
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
DM - MPP
27
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
28
IBM Blue Gene/L
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Historical Top-500 List
29
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
30
BG/L packaging hierarchy
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
ASCI REDCompute Nodes 4,536
Service Nodes 32
Disk I/O Nodes 32
System Nodes (Boot) 2
Network Nodes (Ethernet, ATM) 10
System Footprint 1,600 Square Feet
Number of Cabinets 85
System RAM 594 Mbytes
Topology 38x32x2
Node to Node bandwidth - Bi-directional 800 Mbytes/sec
Bi-directional - Cross section Bandwidth 51.6 Gbytes/sec
Total number of Pentiumâ Pro Processors 9,216
Processor to Memory Bandwidth 533 Mbytes/sec
Compute Node Peak Performance 400 MFLOPS
System Peak Performance 1.8 TFLOPS
RAID I/O Bandwidth (per subsystem) 1.0 Gbytes/sec
RAID Storage (per subsystem) 1 Tbyte
31
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
ASCI RED : I/O Board
32
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
33
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
34
Pipeline Structures• Partitioning of functional unit into a sequence of stages
– Execution time of each stage is < that of the original unit– Total time through sequence of stages is usually > that of the
original unit• Pipeline permits overlapping of multiple operations
– At any one time: each stage is performing different operation– # of operations being performed in parallel = # stages
• Performance– Pipeline increments at clock rate of slowest pipeline stage– Response time for an operation is product of # stages and clock
cycle time– Throughput = clock rate
• i.e. one operation result per clock cycle of pipeline
• Pipeline structures employed in many parts of a computer architecture – to enable high throughput in the presence of high latency – enable faster clock rates
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
35
Pipeline : Concepts
Tc
tp
Tp
pp
cc
pc
pp
cp
tPerf
TPerf
TT
tNT
Tt
1
1
=
=
<
×=
<<
Where :
• Tc is the Logic Latency
• Tp is the aggregated pipeline latency
• tp is the latency for each pipelined step
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
36
Vector Processors• Supports fine grained data parallel semantics
– Many instances of same operation performed concurrently under same control element
– Operates on vector data structures rather than single scalar values– Vector-scalar operations
• Scale a vector by a scalar factor (multiply each vector element by scalar)
– Inter-vector operations• e.g., Pair wise multiplies
– Intra-vector operation• Reduction operators• e.g., sum all elements of a vector
• Exploits pipeline structure– Arithmetic units– Vector registers– Overlap of memory banks access cycles– Overlap of communication with computation
• Limited scaling – upper bound on number of pipeline stages
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
37
Vector Pipeline Architecture
Vector Register (NR)
Vector ALU (NS Stages)
M1 M2 MN
High speed memory busta : time for memory access
MVM
N
asMV
TNP
ttT
=
+= ∑
Where :
• ta is the time for memory access
• ts is the startup time
• TMV is the combined time for Memory Vector
• PM is the memory performance
• tc is the ALU clock time of each step
:
)(
1
21N
tNN
NmancectorPerforAchievedVe
tcerPerformanIdealVecto
cRs
R
c
×+=
=
€
NS := NR
PerfR =NR
(NR + NS) ×tc
= NR
2 × (NR) ×tc
= 12 × tc
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
38
Cray 1
Th
e C
ray 1
Syste
mC
ray 1
logic b
oard
s
• First announced in 1975-6 • 80 MHz Clock rate• Theoretical peak performance (160
MIPS), average performance 136 megaflops, vector optimized peak performance 150 megaflops
• 1-million 64 bit words of high speed memory
• Manufactured by Cray Research Inc.• First Customer was National Center for
Atmospheric Research (NCAR) for 8.86 million dollars.
src : http://en.wikipedia.org/wiki/Cray-1
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
39
SID
E B
AR
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
40
Parallel-Vector-Processors: PVP
• Combines strengths of vector and MPP– Efficiency of vector processing
• Capability computing
– Scalability of massively parallel processing• Capacity and cooperative computing
• Two levels of parallelism– Ultra fine grain vector parallelism with vector pipelining– Medium to coarse-grain processor
• Memory model– Alternative ways of organizing memory & address space– Distributed memory
• Shared memory within node of multiple vector processors• Fragmented or decoupled address space between nodes
– Partitioned global address space• Globally accessible address space• No cache coherence between nodes
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
PVP (e.g. Cray – XMP)
41
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
42
Earth Simulator
src : http://www.es.jamstec.go.jp/esc/eng/
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
43
EarthSimulator (Facts)
• Located in Yokohoma, Japan• Size of the entire center about 4 tennis courts• Can execute 35.86 trillion (35,860,000,000,000) FLOPS,
or 35.86 TFLOPS (LINPACK.• Consists of 640 nodes with each node consisting of 8
vector processors and 16 GB of memory• Totaling 5120 processors and 10 Terabytes of memory• Aggregated disk storage of 700 Terabytes and around
1.6 Petabytes of storage in tape drives • Costs about 350 million dollars• First on the Top500 list for 5 consecutive times.
Surpassed by IBM's BlueGene/L prototype on September 24, 2004
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011 44
PVP Examples
• Early machines– CRI XMP, YMP, C-90, T-90– Cray 2– Fujitsu VP5000
• SX-8
• Cray X1
Steve Scott
Cray Inc.
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
45
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
46
SIMD Array• SIMD semantics
– Single Instruction stream Multiple Data stream– Data set partitioned in to blocks upon which
• One or two dimensions (vectors or matrices)
– Each data block is processed separately– Each data block is controlled by same instruction sequence
– Data exchange cycle
• SIMD Parallel Structure– Node Array of arithmetic units, each coupled to local memory– Interconnect network for global data exchange– Single controller to issue instructions to array nodes
• Early systems broadcast one instruction at a time• Modern systems point to sequence of cached instructions
• SPMD– Single Program Multiple Data Stream– Microprocessor based system where each node runs same program
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
SIMD
47
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
48
MI
MD
Sequencer
Simplified SIMD Diagram
Data Processors
. . .
. . . . .
Switch
10 11
20 22
12
21
1n
2n
00 01 02 0n
n0 nn
Instruction BroadcastBus
Control Processor
MDij
Processing Element
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
49
CM-2
CM-2 General Specifications :• Processors 65,536 • Memory 512 Mbytes • Memory Bandwidth 300Gbits/Sec • I/O Channels 8 Capacity per Channel 40 Mbytes/Sec
Max. • Transfer Rate 320 Mbytes/Sec • Performance in excess of 2500 MIPS• Floating Point performance in excess of 2.5 GFlops
DataVault Specifications : • Storage Capacity 5 or 10 Gbytes • I/O Interfaces 2 Transfer Rate, • Burst 40 Mbytes/Sec Max. • Aggregate Rate 320 Mbytes/Sec
• Originated at MIT, by Danny Hillis• Commercialized at Thinking Machines Corp. src : http://www.svisions.com/sv/cm-dv.html
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
50
ClearSpeed SIMD Accelerator
• 1997 Intel ASCI Red Supercomputer• 1TFLOPS, 2,500 sq.
ft., 800KW, $55Million
• 2007 ClearSpeed + Intel Dense Cluster• 1 TFLOPS, 25 sq. ft.,
<7 KW, <$200K
• Medium-Coarse grained SIMD• 130nm fabrication technology• 250 MHz clock rate• 100 Gflops peak, 66 Gflops sustained
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Tsubame
• Heterogeneous computing : Added ClearSpeed Boards
• 648 nodes resulting in 38.5 TFLOPS
• 648 nodes with 360 ClearSpeed boards to 47.38 TFLOPS
51
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
52
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011 53
Special Purpose Devices • SPD• Optimized for a given algorithm or class of
problems• Functional elements and dataflow path mirror
the requirements of a specific algorithm• Usually exploits fine grain parallelism for very
high parallelism• Best for arithmetic (or logic) intensive
applications with limited memory access requirements
• Best for strong temporal and spatial locality• Systolic Arrays are one class of such
machines widely used in digital signal processing
• Examples– MD-Grape first Petaflops machine, for N-body
problem– GPU Graphics Processing Unit, e.g. NVIDIA– FPGA field programmable gate array
• Allows reconfiguration of logic array
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011 54
Systolic Arrays
Host
Cell 1
Cell 2
Cell 3
Cell n
Interface Unit
address
XY
Warp Processor Array
XY
Example implementation:Warp architecture
∑=
=n
k
kjikij bac1
A
A
CC
BB
Processing Element
Matrix multiplication on Systolic ArrayReferences:M. Annaratone, E. Arnould, et al, “The Warp Computer: Architecture, Implementation, and Performance”Y. Yang, W. Zhao, and Y. Inoue, “High-Performance Systolic Arrays for Band Matrix Multiplication”
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
55
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
56
Introduction to SMP
• Symmetric Multiprocessor
• Building block for large MPP• Multiple processors
– 2 to 32 processors– Now Multicore
• Uniform Memory Access (UMA) shared memory– Every processor has equal access in equal time to all banks of
the main memory
• Cache coherent– Multiple copies of variable maintained consistent by hardware
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
SMP - UMA
57
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
58
SMP Node Diagram
USBPeripherals
JTAG
MP
L1L2
MP
L1L2
L3
MP
L1L2
MP
L1L2
L3
M1 M1 Mn-1
Controller
S
S
NIC NIC
Legend : MP : MicroProcessorL1,L2,L3 : CachesM1.. : Memory BanksS : StorageNIC : Network Interface Card
Ethernet
PCI-e
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
DSM - NUMA
59
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Challenges to Computer Architecture• Expose and exploit extreme fine-grain parallelism
– Possibly multi-billion-way (for Exascale)– Data structure-driven (use meta-data parallelism)
• State storage takes up much more space than logic– 1:1 flops/byte ratio infeasible– Memory access bandwidth is the critical resource
• Latency – can approach a million cycles (10,000 or more cycles, typical)– All actions are local– Contention due to inadequate bandwidth
• Overhead for fine grain parallelism must be very small – or system can not scale– One consequence is that global barrier synchronization is untenable
• Power consumption• Reliability
– Very high replication of elements– Uncertain fault distribution– Fault tolerance essential for good yield
• Design complexity– Impacts development time, testing, power, and reliability
60
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
61
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Multi-Core
• Motivation for Multi-Core– Exploits increased feature-size and density– Increases functional units per chip (spatial efficiency)– Limits energy consumption per operation– Constrains growth in processor complexity
• Challenges resulting from multi-core– Relies on effective exploitation of multiple-thread parallelism
• Need for parallel computing model and parallel programming model– Aggravates memory wall
• Memory bandwidth– Way to get data out of memory banks– Way to get data into multi-core processor array
• Memory latency• Fragments L3 cache
– Pins become strangle point• Rate of pin growth projected to slow and flatten• Rate of bandwidth per pin (pair) projected to grow slowly
– Requires mechanisms for efficient inter-processor coordination• Synchronization• Mutual exclusion• Context switching
62
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
IBM Blue Gene/L
63
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Intel Core i7
64
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
AMD Quad Core Architecture
65
AMD quad-core x86 Opteron processor layout
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
66
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
IBM/SONY Cell Architecture
• Product of the “STI” alliance: SCEI (Sony), Toshiba and IBM
• Budget estimate ~$400 mil• Primary design center in Austin, TX
(March 2001)• Modified POWER4 toolchain
• The effort took 4 years, with over 400 engineers and 11 IBM centers involved
• Original target applications:
– Sony Playstation 3– IBM blade server– Toshiba HDTV
67
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Cell Processor in Numbers
• 234 mil transistors• 221mm2 die on 90nm process• SOI, low-k dielectrics, copper interconnects• 3.2GHz clock speed (over 5Ghz in lab)• Peak performance:
– over 256Gflops @4GHz, single precision– ~26Gflops, double precision– memory bandwidth: 25.6Gbytes/s– I/O bandwidth: 76.8Gbytes/s (48.8 outbound, 32
inbound)• Power consumption undisclosed, estimated at 30W
(MacWorld) or 50-80W (other sources); 5 power states
68
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Internal Structure
69
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Cell Components and Layout
• One Power Processing Element (PPE)
• Multiple Synergistic Processing Elements (SPE)
• Element Interconnect Bus (EIB)
• Dual channel XDR memory controller
• FlexIO external I/O interface
70
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Conventional Strategies to Address the Multi-Core Challenge
• Maintain status quo– Investment in current code stack– Investment in core design
• Increase L2/L3 cache size– Attempt to exploit existing temporal locality
• Increase chip I/O bandwidth– Reduce contention– Eventually embedded optical interfaces chip-to-chip
• Memory bandwidth aggregation through “weaver” chip– Balances processor data demand with memory supply rate– Enables and coordinates multiple overlapping memory banks
• Exploit job stream parallelism– Independent jobs
• O/S scheduling
– Concurrent parametric processes• Multiple instances of same job across parametric set• e.g., Condor
– Coarse grain communicating sequential processes• Message passing; e.g., MPI• Barrier synchronization
71
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Limitations of Conventional Incremental Approaches to MultiCore
• Its not just SMP on a chip– Cores on wrong side of the pins– Users expect to see performance gain on existing applications
• Highly sensitive to temporal locality– Fragile in the presence of memory latency– Uses up majority of chip area on caching
• Emphasizes ALU as precious resource– ALU low spatial cost – Memory bandwidth is pacing element for data intensive problems
• Low effective energy usage– Suffers from core complexity
• Does not address intrinsic problems of low efficiency– Just hoping to stay even with Moore’s Law– Single digit sustained/peak performance– Bad when ALU is critical path element
• The Memory Wall is getting Worse!
72
1997 1999 2001 2003 2006 2009
X-Axis
0.1
1
10
100
1000
Tim
e (n
s)
0
100
200
300
400
500
Mem
ory
to C
PU
Rat
io
CPU Clock Period (ns)Memory System Access Time
Ratio
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
Commodity Clusters
• Distributed Memory systems
• Superior performance to cost
• Dominant parallel systems architecture on the Top 500 List
• Combines off the shelf systems in scalable structure
• Employs commercial high-bandwidth networks for integration
• Message Passing programming model used (e.g. MPI)
• First cluster on Top500 : Berkley Now, 1997
73
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
74
Topics• Introduction• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Performance Metrics• Coarse-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors• Current generation multicore and heterogeneous
architectures• Summary – Material for the Test
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
75
Summary – Material for the Test
• HPC System Stack – slide 5• Performance factors : Technology speed – slide 7• Performance factors : Parallelism –slide 9• Sources of Performance Degradation – slide 10• Computer architecture – slides 12-16• Parallel Structures – slide 18• Performance issues of parallel structures – slide 19• Scalability – slide 21• Performance Metrics – slide 22• Basic uni-processor architecture elements – slide 24• Multiprocessor architecture slides – slides 25 • MPP systems – slides 26,27• Pipeline structures – slides 34,35• Vector processors – slides 36,37• Parallel vector processors (PVP) – slides 40, 41• SIMD – slides 46, 47• Challenges to computer architecture – slides 60
CSC 7600 Lecture 2 : Parallel Computer Architecture, Spring 2011
76
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
COMMODITY CLUSTERS
Prof. Thomas Sterling
Department of Computer Science
Louisiana State University
January 25, 2011
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
2
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
3
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
4
What is a Commodity Cluster
• It is a distributed/parallel computing system
• It is constructed entirely from commodity subsystems
– All subcomponents can be acquired commercially and separately
– Computing elements (nodes) are employed as fully operational
standalone mainstream systems
• Two major subsystems:
– Compute nodes
– System area network (SAN)
• Employs industry standard interfaces for integration
• Uses industry standard software for majority of services
• Incorporates additional middleware for interoperability among
elements
• Uses software for coordinated programming of elements in parallel
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
6
Earth Simulator and
TSUBAME
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
7
Red Sky
• One of the largest clusters in the
world (located in Sandia National
Laboratories, USA)
• Sun Blade x6275 system family
• 41616 Cores
• Intel EM64T Xeon X55xx (Nehalem-
EP) 2930 MHz (11.72 GFlops)
• 22104 GB main memory
• Number 10 on TOP500
• Infiniband interconnection
• Peak perforamnce:
487 Tflops
• R_max:
423 Tflops
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
8
Commodity Clusters vs “Constellations”
16X16X
16X 16X
System Area Network
64 Processor Constellation
64 Processor Commodity Cluster
4X
4X
4X
4X
4X 4X 4X 4X
4X
4X
4X
4X
4X 4X 4X 4X
System Area Network
• An ensemble of N nodes each comprising p computing elements
• The p elements are tightly bound shared memory (e.g., smp, dsm)
• The N nodes are loosely coupled, i.e., distributed memory
• p is greater than N
• Distinction is which layer gives us the most power through parallelism
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
9
Columbia
• NASA’s largest computer
• NASA Ames Research Center
• A Constellation
– 20 nodes
– SGI Altix 512 processor nodes
– Total: 10,240 Intel Itanium-2
processors
• 400 Terabytes of RAID
• 2.5 Petabytes of silo farm tape
storage
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
10
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
11
A Brief History of Clusters
• 1957 – SAGE by IBM & MIT-LL for Airforce NORAD
• 1976 -- Ethernet
• 1984 – Cluster of 160 Apollo workstations by NSA
• 1985 – M31 Andromeda by DEC, 32 VAX 11/750
• 1986 – Production Condor cluster operational
• 1990 – PVM released
• 1993 – First NOW workstation cluster at UC Berkeley
• 1993 – Myrinet introduced
• 1994 – First Beowulf PC cluster at NASA Goddard
• 1994 – MPI standard
• 1996 – >1Gflops
• 1997 – Gordon Bell Prize for Price-Performance
• 1997 – Berkeley NOW first cluster on Top-500
• 1997 -- >10 Gflops
• 1998 – Avalon by LANL on Top500 list
• 1999 -- >100 Gflops
• 2000 – Compaq and PSC awarded 5 Tflops by NSF
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
12
UC-Berkeley NOW Project
• NOW-1 1995
• 32-40 SparcStation 10s and
20s
• originally ATM
• first large myrinet network
NOW-2 1997
100+ Ultra Sparc 170s
128 MB, 2 2GB disks, ethernet, myrinet
largest Myrinet configuration in the world
First cluster on the TOP500 list
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
13
NOW Accomplishments
• Early prototypes in 1993 & 1994
• First Inktomi
• Complete Glunix + virtual network environment– able to page many processes onto dedicated
user-level network resources
• NPACI production resource since 1998
• Active Messages demonstrates user level communication in full Unix environment
• First cluster on the TOP500 list
• Set all Parallel Disk-disk sort records (2 yrs)– 500 MB/s disk bandwidth
– 1,000 MB/s network bandwidth
• Basis for studies in novel OS structures
Minute Sort
SGI Power
Challenge
SGI Orgin
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80 90 100
Processors
Gig
ab
yte
s s
ort
ed
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
14
NASA Beowulf Project
Wiglaf - 1994
16 Intel 80486 100 MHz
VESA Local bus
256 Mbytes memory
6.4 Gbytes of disk
Dual 10 base-T Ethernet
72 Mflops sustained
$40K
Hrothgar - 1995
16 Intel Pentium100 MHz
PCI
1 Gbyte memory
6.4 Gbytes of disk
100 base-T Fast Ethernet
(hub)
240 Mflops sustained
$46K
Hyglac-1996 (Caltech)
16 Pentium Pro 200 MHz
PCI
2 Gbytes memory
49.6 Gbytes of disk
100 base-T Fast Ethernet
(switch)
1.25 Gflops sustained
$50K
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
15
Beowulf Accomplishments
• An experiment in parallel computing systems
• Established vision low-cost HPC
• Demonstrated effectiveness of PC clusters for some classes of applications
• Provided networking software in Linux
• Mass Storage with PVFS
• Provided cluster management tools
• Achieved >10 Gflops performance
• Gordon Bell Prize for Price-Performance
• Conveyed findings to broad community
• Tutorials and the book
• Provided design standard to rally community
• Spin-off of Scyld Computing Corp.
Hive at GSFC
Naegling at Caltech CACR
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
16
“Do it Yourself Supercomputers”
• Synthesis of just-ready hardware/software elements
• Narrow window of opportunity
• PCs just capable of a few Mflops
• Ethernet LAN (10 base-T) just cheap enough
• A cost constrained requirement with funding
• An open source Unix, albeit immature
• Experience with clustering
• A stable message passing library
• Talent availability to fill the gaps
• Willingness to win or fail
• Modest and well defined goals, vision, and path
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
17
Dominance of Clusters in HPC
• Every major HPC vendor (but 1) has a
cluster product
– IBM
– HP
– SUN
– NEC
– Fujitsu
– SGI
– Cray
• Additional vendors dedicated to clusters
– Penguin
– Dell
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
18
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
19
Clusters Dominate Top-500
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
20
Why are Clusters so Prevalent
• Excellent performance to cost for many workloads– Exploits economy of scale
• Mass produced device types
• Mainstream standalone subsystems
– Many competing vendors for similar products
• Just in place configuration– Scalable up and down
– Flexible in configuration
• Rapid tracking of technology advance– First to exploit newest component types
• Programmable– Uses industry standard programming languages and tools
• User empowerment• Low cost, ubiquitous systems
• Programming systems make it relatively easy to program for expert users
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
21
1st printing: May, 1999
2nd printing: Aug. 1999
MIT Press
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
22
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
23
What You Need to Know about Clusters
• Key system elements
– SMP Node
– Interconnect Networks
– Operating Systems
– Resource Management / Scheduling systems
• Programming & Runtime environment
– Message-passing/Cooperative programming model
– Programming languages & compilers, debuggers
• Performance Measurement & Profiling
– How is performance effected
– How to measure how well the applications behave
– How to optimize application behavior
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
24
Key Parameters for Cluster Computing
• Peak floating point performance
• Sustained floating point performance
• Main memory capacity
• Bi-section bandwidth
• I/O bandwidth
• Secondary storage capacity
• Organization– Processor architecture
– # processors per node
– # nodes
– Accelerators
– Network topology
• Logistical Issues– Power Consumption
– HVAC / Cooling
– Floor Space (Sq. Ft)
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
25
Where’s the Parallelism
• Inter-node
– Multiple nodes
– Primary level for commodity clusters
– Secondary level for constellations
• Multi socket, intra-node
– Routinely 1, 2, 4, 8
– Heterogeneous computing with accelerators
• Multi-core, intra-socket
– 2, 4 cores per socket
• Multi-thread, intra-core
– None or two usually
• ILP, intra-core
– Multiple operations issued per instruction
• Out of order, reservation stations
• Prefetching
• Accelerators
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
26
Cluster System
MPL1L2
MPL1L2L3
MPL1L2
MPL1L2L3
M1
M1
Mn-1
Controller
S
S
NIC
NIC
MPL1L2
MPL1L2L3
MPL1L2
MPL1L2L3
M1
M1
Mn-1
Controller
S
S
NIC
NIC
MPL1L2
MPL1L2L3
MPL1L2
MPL1L2L3
M1
M1
Mn-1
Controller
S
S
NIC
NIC
MPL1L2
MPL1L2L3
MPL1L2
MPL1L2L3
M1
M1
Mn-1
Controller
S
S
NIC
NIC
Resource management & scheduling subsystem
Login & Cluster Access
Co
mp
ute
No
des
Interco
nn
ect N
etwo
rk
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
27
Constituent Hardware Elements
• Compute Nodes (“nodes”)
– Standalone mainstream products
– Processors and accelerators
– Memory and caches
– Chip set
– Interfaces
• System Area Network(s)
– Network interface controllers (NIC)
– Switches
– Cables
• External I/O
– File system
– Internet access
– User interface
– Management and administration
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
28
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
29
Microprocessor Clock Rate
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
31
Compute Node Diagram
MP
L1L2
MP
L1L2
L3
MP
L1L2
MP
L1L2
L3
M0 M1 Mn-1
Controller
S
S
NIC NICUSBPeripherals
JTAG
Legend : MP : MicroProcessorL1,L2,L3 : CachesM1.. : Memory BanksS : StorageNIC : Network Interface Card
Ethernet
PCI-e
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
33
Parameters for Cluster Nodes
• Processor architecture family (AMD Opteron, Intel Xeon, IBM Power)• Number of processor chips (2)• Number of processor cores per chip (multicore) (3-4)• Memory capacity per processor chip (2 GBytes per core)• Processor core clock rate (3)
– GHz
• Operations per instruction issue, ILP (2 – 4 floating point operations)• Cache size per core (L1, L2, L3)• Distributed or shared memory (SMP) structure
– Cache coherent?
• Number and class of network ports• Latency to main memory (100 – 400 cycles)
– Measured in processor clock cycles
• Disk spindles and capacity (0, 1, or 2)• Ancillary I/O ports• Packaging issues
– Power– Size (1 to 4 u) (http://en.wikipedia.org/wiki/Rack_unit)– Cost
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
34
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
35
The History of Linux
• Started out with Linus' frustation on available affordable operating
systems for the PC
• He put together a rudimentary scheduler, and later added on more
features until he could bootstrap the kernel (1991).
• The source was released on the internet in hope that more people
would contribute to the kernel
• GCC was ported, a C library was added and a primitive serial and
tty driver code
• Networks, file systems were added
• Slackware
• RedHat
• Extreme Linux
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
36
Open Source Software
• Evolution of PC Clusters has benefited from Open Source Software
• Early examples
– Gnu compiler tools, FreeBSD, Linux, PVM
• Advantages
– Provides shared infrastructure – avoids duplication of effort
– Permits wide collaborations
– Facilitates exploratory studies and innovation
• Free software is not necessarily OSS
• Business model in state of flux: how to fund free deliverables
• Important synergy between OSS standard infrastructure software and
proprietary ISV target-specific software:
– OSS provides common framework
– For-profit software provides incentive and resources
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011 37
Linux DistributionsAlphanet Linux
Alzza Linux
Andrew Linux
Apokalypse
Armed Linux
ASPLinux
Bad Penguin
Bastille Linux
Best Linux
BlackCat Linux
Blue Linux
Bluecat Linux
BluePoint Linux
Brutalware
Caldera OpenLinux
Cclinux
ChainSaw Linux
CLEClIeNUX
Conectiva
CoolLinux
Coyote Linux
Corel
COX-Linux
Darkstar Linux
Debian Definite
Linux
deepLINUX
Delix
Dlite (Debian Lite)
DragonLinux
Eagle Linux M68K
easyLinux
Elfstone Linux
Embedix
Enoch
Eonova Linux
ESware
Etlinux
Eurielec Linux
FinnixFloppi Gentoo
Linux
Gentus Linux
Green Frog Linux
Halloween Linux
Hard Hat Linux
HispaFuentes
HVLinux
Icepack
Immunix
OSIndependence
InfoMagick Workgroup
Server
Ivrix
ix86 Linux
JBLinux
Jurix Linux
Kondara
Krud
KW Linux
KSI Linux
L13Plus
Laser5
Leetnux
Lightening
Linpus Linux
Linux Antarctica
Linux by Linux
Linux GT Server Edition
Linux Mandrake
Linux MX
LinuxOne
LinuxPPC
LinuxPPP
LinuxSIS
LinuxWare
Linux-YeS
LNX System
Lunet
LuteLinux
LST
Mastodon
MaxOS&trade
MIZI Linux OS
MkLinux
MNIS Linux
MicroLinux
Monkey Linux
NeoLinux
Newlix OfficeServer
NoMad Linux
Ocularis
Open Kernel Linux
Open Share Linux
OS2000
Peanut Linux
PhatLINUX
PingOO
Plamo Linux
Platinum Linux
Power Linux
Progeny Debian
Project Freesco
Prosa Debian
Pygmy Linux
Red Flag Linux
Red Hat Linux
Redmond Linux
Rock Linux
RT-Linux
Scrudge Ware
Secure Linux
Skygate Linux
Slacknet Linux
Slackware
Slinux
SOT Linux
Spiro
Stampede Linux
Storm Linux
S.u.SE
Thin Linux
TINY Linux
Trinux
Trustix Secure Linux
TurboLinux
Turquaz
UltraPenguin
Ute-Linux
VA-enhanced RedHat Linux
VectorLinux
Vedova Linux
Vine Linux
White Dwarf Linux
Whole Linux
WinLinux 2000
WorkGroup Solutions
Linux Pro Plus
Xdenu
Xpresso Linux 2000
XTeam Linux
Yellow Dog Linux
Yggdrasil Linux
ZiiF Linux
ZipHam
ZipSlack
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
38
Operating System
• What is an Operating System?
– A program that controls the execution of application programs
– An interface between applications and hardware
• Primary functionality
– Exploits the hardware resources of one or more processors
– Provides a set of services to system users
– Manages secondary memory and I/O devices
• Objectives
– Convenience: Makes the computer more convenient to use
– Efficiency: Allows computer system resources to be used in an
efficient manner
– Ability to evolve: Permit effective development, testing, and
introduction of new system functions without interfering with service
Source: William Stallings “Operating Systems: Internals and Design Principles (5th Edition)”
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
39
Services Provided by the OS
• Program development
– Editors and debuggers
• Program execution
• Access to I/O devices
• Controlled access to files
• System access
• Protection
• Error detection and response
– Internal and external hardware errors
– Software errors
– Operating system cannot grant request of application
• Accounting
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
40
Layers of Computer System
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
41
Resources Managed by the OS
• Processor
• Main Memory
– volatile
– referred to as real memory or primary memory
• I/O modules
– secondary memory devices
– communications equipment
– terminals
• System bus
– communication among processors, memory, and I/O modules
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
42
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
43
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
44
Programming on Clusters
• Several ways of programming application on clusters− Throughput – jobstream
− Decoupled Work Queue Model – SPMD for parameter studies
− Communicating Sequential Processes (CSP)
− Multi threaded
• Throughput: job stream– PBS, Maui
• Decoupled Work Queue Model : SPMD, e.g. parametric studies– Condor
• Communicating Sequential Processes– Message passing
– Distributed memory
– Global barrier synchronization
– e.g., MPI
• Multi threaded– Limited to intra-node programming
– Shared memory
– e.g., OpenMP
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Throughput Computing
• Simplest form of parallel computing
• Separate jobs on separate compute nodes
– Independent tasks on independent nodes
• No intra application / cross node communication
• “job stream” workflow
• Capacity computing
– Distinguished from cooperative and capability computing
– Scaling dependent on number of concurrent jobs
• Performance
– Throughput
– Total aggregate operations per second achieved
• Widely used for servers
45
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Decoupled Work Queue Model
• Concurrent disjoint tasks
• Parametric Studies
– SPMD (single program multiple data)
• Very coarse grained
• Example software package : Condor
• Processor farms and clusters
• Throughput Computing Lecture covers this model of
parallelism in greater depth
46
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
47
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011 48
Some Node Interconnect Options
• Current Generation
– Gigabit Ethernet (~1000 Mb/s)
– 10 Gigabit Ethernet
– 40 Gigabit Ethernet and 100 Gigabt Ethernet (100GbE)
standards are in draft as of 2009
– Infiniband (IBA)
• Previous Generation
– Fast Ethernet (~100 Mb/s)
– Myricom’s Myrinet-2000 (~1600 Mb/s)
– SCI (~4000 Mb/s)
– OC-12 ATM (~622 Mb/s)
– Fiber Channel (~100 MB/s)
– USB (12 Mb/s)
– Firewire (IEEE 1394 400 Mb/s)
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
49
Fast and Gigabit Ethernet
• Cost effective
• Lucent, 3com, Cisco, etc.
• Directly leverage LAN technology and market
• Up to 384 100 Mbps ports in one switch
• Switches can be stacked on connected with multiple gigabit links
• 100 Base-T:– Bandwidth: > 11 MB/s
– Latency: < 90 microseconds
• 1000 Base-T:– Bandwidth: ~ 50 MB/s
– Latency: < 90 microseconds
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
50
Myrinet
• High Performance: 2+2 Gbps
• Low latency: 11 microseconds
• Fiber and copper interconnects
• High Availability – auto reroute
• 4, 8,16 and 64 port switches, stackable
• Scalable to 1000s of hosts
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
InfiniBand
51
• High Performance: 10 - 20 Gbps
• Low latency: 1.2 microseconds
• Copper interconnects
• High availability - IEEE 802.3ad Link Aggregation / Channel Bonding
http://www.hpcwire.com/hpc/1342206.html
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Network Interconnect Topologies
52
TORUS
FAT-TREE (CLOS)
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
53
Dell PowerEdge SC1435
Opteron, IBA
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
54
Example: 320-host Clos topology of
16-port switches
64 hosts 64 hosts 64 hosts 64 hosts 64 hosts
(From Myricom)
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Arete Infiniband Network
55
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
56
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
57
Schedulers : PBS
Workload management system – coordinates resource utilization policy and user job requirements– Multi users, Multi jobs, Multi nodes
• Both Open Source and Commercially supported (Veridian)
• Functionality– Manages parallel job execution
– Interactive and batch cross system scheduling
– Security and access control lists
– Dynamic distribution and automatic load-leveling of workload
– Job and user accounting
• Accomplishments– Runs on all Unix and Linux platforms
– Supports MPI
– First release 1995
– 2000 sites registered, 1000 people on the mailing list
– PBSPro sales at >5000 cpu’s
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
58
Schedulers : Maui (Moab)
• Cluster Resources Inc.
• Advanced systems software tool for more optimal job
scheduling
• Improved administration and statistical reporting
capabilities
• Analytical simulation capabilities to evaluate different
allocation and prioritization schemes.
• Offers different classes of services to users, allowing
high priority users to be scheduled first, while
preventing long-term starvation of low priority jobs.
• SMP Enabled
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
59
Schedulers : Condor
• Distributed Task Scheduler
• Emphasis on throughput or capacity computing
• Services
– Automates cycle harvesting and workstation farms
– Distributed time-sharing and batch processing resource
– Exploits opportunitstic versus dedicated resources
– Permit preemptive acquisition of resources
– Transparent checkpointing
– Remote I/O – preserve local execution environment (require relinking)
– Asynchronous process management, master-worker processing
• Accomplishments
– First production system operational in 1986
– U. of Wisconsin 1300 CPU’s Condor controlled on campus
– Used by:
• large software house for bills and testing,
• Xerox printer simulation,
• Core Digital Pictures rendering of movies,
• INFN for high energy physics,
• 250 machines at NAS, half million hours
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
60
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
61
MPI Software
• Community wide standard process
– Leveraged experiences with NX, PVM, P4, Zipcode, others
• Dominant programming model for clusters
• Multiple implementations both OSS and commercial (MPI Soft Tech)
– All of MPI-1
– MPI I./O
– All of MPI-2
– MPI-3 under development
• Functionality
– Message passing model for distributed memory platforms
– Support for truly scalable operations (1000s nodes)
• Rich set of collective operations (gathers, reduces, scans, all to all)
• Scalable one sided operations (fence barrier synchronization, group-oriented synchronization)
– Dynamic processes (2) to spawn, disconnect etc. with scalability
• MPICH-2 entirely new rewrite
• OpenMPI includes fault tolerant capability
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
62
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Compilers & Debuggers
• Compilers : – Intel C/ C++ / Fortran
– PGI C/ C++ / Fortran
– GNU C / C++ / Fortran
• Libraries :– Each compiler is linked against MPICH
– Mesh/Grid Partitioning software : METIS etc.
– Math Kernel Libraries (MKL)
– Intel MKL, AMD MKL, GNU Scientific Library (GSL)
– Data format libraries : NetCDF, HDF 5 etc
– Linear Algebra Packages : BLAS, LAPACK etc
• Debuggers– gdb
– Totalview
• Performance & Profiling tools : – PAPI
– TAU
– Gprof
– perfctr
63
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Distributed File Systems
• A distributed file system is a file system that is stored locally on one system (server) but is accessible by processes on many systems (clients).
• Multiple processes access multiple files simultaneously.
• Other attributes of a DFS may include :
– Access control lists (ACLs)
– Client-side file replication
– Server- and client- side caching
• Some examples of DFSes:
– NFS (Sun)
– AFS (CMU)
– PVFS (Clemson, Argonne), OrangeFS
– Lustre (Sun)
– GPFS (IBM)
• Distributed file systems can be used by parallel programs, but they have significant disadvantages :
– The network bandwidth of the server system is a limiting factor on performance
– To retain UNIX-style file consistency, the DFS software must implement some form of locking which has significant performance implications
64
Ohio Supercomputer Center
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Distributed File System : NFS
• Popular means for accessing remote file
systems in a local area network.
• Based on the client-server model , the remote
file systems are “mounted” via NFS and
accessed through the Linux virtual file system
(VFS) layer.
• NFS clients cache file data, periodically
checking with the original file for any changes.
• The loosely-synchronous model makes for
convenient, low-latency access to shared
spaces.
• NFS avoids the common locking systems used
to implement POSIX semantics.
65
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
66
Parallel Virtual File System (PVFS)
• Clemson University - 1993
• Objective: high throughput file system – DOE, NASA, (GPL)
• Strategy:
– exploit parallelism of bandwidth
– provide user interface so that applications can make powerful requests such as large collection of non-contiguous data with single request for multidimensional data sets,
– allow application direct access to server:
• multiple application tasks directly access/spawn multiple file servers without going through kernel or central mechanism.
• N-clients and N-servers
• Single file spread across multiple disks and nodes and accessed by multiple tasks in an application.
• Scaling facilitated by eliminating single bottleneck
• Actual distribution of a file is configurable on a file by file basis.
• Reactive scheduling addresses problem of network contention and adaptive to file system load
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
67
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Measuring Performance on Clusters
• Ways of measuring performance– Wall clock time
– Benchmarks
– Processor efficiency factors
– Scalability
– MPI communications and synchronization overhead
– System operations
• Tools– PAPI
– Tau
– Ganglia
– Many others
68
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
MPI Performance Measurement : VAMPIR
69src : http://mumps.enseeiht.fr/
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
MPI Performance : Tau
70
src : http://www.cs.uoregon.edu/research/tau
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
Topics
• Introduction to Commodity Clusters
• A brief history of Cluster computing
• Dominance of Clusters
• Core systems elements of Clusters
• SMP Nodes
• Operating Systems
• DEMO 1 : Arete Cluster Environment
• Throughput Computing
• Networks
• Resource Management / Scheduling Systems
• Message-passing/Cooperative programming model
• Cluster programming/application runtime environment
• Performance measurement & profiling of applications
• Summary Materials for Test
71
CSC 7600 Lecture 3 : Commodity Clusters,Spring 2011
72
Summary – Material for the Test
• What is a commodity cluster – slide 4
• Commodity clusters vs “Constellations” – slide 8
• Key parameters for cluster computing – slide 24
• Where is the parallelism – slide 25
• Parameters for cluster nodes – slide 33
• Node operating system – slide 38,39,40,41
• Programming clusters – slide 44
• Throughput computing – slide 45
• Decoupled work queue model – slide 46
• Interconnect options – slide 48
• Scheduling systems – slide 57, 58, 59
• Message passing : MPI software – slide 61
• Distributed file systems – slide 64
• Measuring performance on cluster: Metrics & Tools – slide 68
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
CAPACITY COMPUTING
Prof. Thomas Sterling Department of Computer Science Louisiana State University February 1, 2011
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
2
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
3
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Key Terms and Concepts
4
Problem
instructions
CPU
Conven1onal serial execu+on where the problem is represented as a series of instruc1ons that are executed by the CPU (also sequen+al execu+on)
CPU CPU CPU CPU
instructions
Task Task Task Task Problem
Parallel execu+on of a problem involves par11oning of the problem into mul1ple executable parts that are mutually exclusive and collec1vely exhaus1ve represented as a par1ally ordered set exhibi1ng concurrency.
Parallel compu1ng takes advantage of concurrency to : • Solve larger problems within
bounded 1me • Save on Wall Clock Time • Overcoming memory constraints • U1lizing non-‐local resources
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Key Terms and Concepts • Scalable Speedup : Relative reduction of execution time of a fixed
size workload through parallel execution
• Scalable Efficiency : Ratio of the actual performance to the best possible performance.
5
Speedup = execution_ time_on_one_ processorexecution_ time_on_N _ processors
Efficiency = execution_ time_on_one_ processorexecution_ time_on_N _ processors!N
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
6
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Defining the 3 C’s … • Main Classes of computing :
– High capacity parallel computing : A strategy for employing distributed computing resources to achieve high throughput processing among decoupled tasks. Aggregate performance of the total system is high if sufficient tasks are available to be carried out concurrently on all separate processing elements. No single task is accelerated. Uses increased workload size of multiple tasks with increased system scale.
– High capability parallel computing : A strategy for employing tightly coupled structures of computing resources to achieve reduced execution time of a given application through partitioning into concurrently executable tasks. Uses fixed workload size with increased system scale.
– Cooperative computing : A strategy for employing moderately coupled ensemble of computing resources to increase size of the data set of a user application while limiting its execution time. Uses a workload of a single task of increased data set size with increased system scale.
7
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Strong Scaling Vs. Weak Scaling
8
Machine Scale (# of nodes)
Wor
k pe
r tas
k
Weak Scaling
Strong Scaling
1 2 4 8
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Strong Scaling, Weak Scaling
9
Strong Scaling Weak Scaling
Strong Scaling
Weak Scaling
Tota
l Pro
blem
Siz
e
Machine Scale (# of nodes)
Gra
nula
rity
(siz
e / n
ode)
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Defining the 3 C’s … • High capacity computing systems emphasize the
overall work performed over a fixed time period. Work is defined as the aggregate amount of computation performed across all functional units, all threads, all cores, all chips, all coprocessors and network interface cards in the system.
• High capability computing systems emphasize improvement (reduction) in execution time of a single user application program of fixed data set size.
• Cooperative computing systems emphasize single application weak scaling – Performance increase through increase in problem size
(usually data set size and # of task partitions) with increase in system scale
10
Adapted from : High-performance throughput computing S Chaudhry, P Caprioli, S Yip, M Tremblay - IEEE Micro, 2005 - doi.ieeecomputersociety.org
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Strong Scaling, Weak Scaling
11
Weak Scaling Strong Scaling
Capacity Capability Cooperative Single Job
Workload Size Scaling
• Capability • Primary scaling is decrease in response time proportional to increase in resources
applied • Single job, constant size – goal: response-time scaling proportional to machine size • Tightly-coupled concurrent tasks making up single job
• Cooperative • Single job, (different nodes working on different partitions of the same job) • Job size scales proportional to machine • Granularity per node is fixed over range of system scale • Loosely coupled concurrent tasks making up single job
• Capacity • Primary scaling is increase in throughput proportional to increase in resources
applied • Decoupled concurrent tasks, each a separate job, increasing in number of instances
– scaling proportional to machine.
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
12
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Models of Parallel Processing • Conventional models of parallel processing
– Decoupled Work Queue (covered in segment 1 of the course) – Communicating Sequential Processing (CSP message passing)
(covered in segment 2) – Shared memory multiple thread (covered in segment 3)
• Some alternative models of parallel processing – SIMD
• Single instruction stream multiple data stream processor array – Vector Machines
• Hardware execution of value sequences to exploit pipelining – Systolic
• An interconnection of basic arithmetic units to match algorithm – Data Flow
• Data precedent constraint self-synchronizing fine grain execution units supporting functional (single assignment) execution
13
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Shared memory multiple Thread
• Static or dynamic • Fine Grained • OpenMP • Distributed shared memory systems • Covered in Segment 3
14
Network
CPU 1 CPU 2 CPU 3
Orion JPL N
ASA
memory memory memory
Network
CPU 1 CPU 2 CPU 3
memory memory memory
Symmetric Mul1 Processor (SMP usually cache coherent )
Distributed Shared Memory (DSM usually cache coherent)
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Communicating Sequential Processes
• One process is assigned to each processor
• Work done by the processor is performed on the local data
• Data values are exchanged by messages
• Synchronization constructs for inter process coordination
• Distributed Memory • Coarse Grained • MPI application programming interface • Commodity clusters and MPP
– MPP is acronym for “Massively Parallel Processor”
• Covered in Segment 2
15
Network
CPU 1 CPU 2 CPU 3
memory memory memory
Distributed Memory (DM oRen not cache coherent)
QueenBee
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Decoupled Work Queue Model
• Concurrent disjoint tasks – Job stream parallelism – Parametric Studies
• SPMD (single program multiple data)
• Very coarse grained • Example software package : Condor • Processor farms and commodity clusters • This lecture covers this model of parallelism
16
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
17
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Ideal Speedup Issues
18
• W is total workload measured in elemental pieces of work (e.g. operations, instructions, subtasks, tasks, etc.)
• T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads)
• wi is work for a given task I, measured in operations • Example: here we divide a million (really Mega)
operation workload, W, into a thousand tasks, w1 to w1024 each of a 1 K operations
• Assume 256 processors performing workload in parallel • T(256) = 4096 steps, speedup = 256, Eff = 1
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Ideal Speedup Example
19
W
220
w1 w210 210
P28
210 210 210 210
Processors
212
P1
T(1)=220 T(28)=212
!
Speedup =220
212= 28
!
Efficiency =220
212 " 28= 20 =1
Units : steps
W = wii!
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Granularities in Parallelism Overhead
• The additional work that needs to be performed in order to manage the parallel resources and concurrent abstract tasks that is in the critical time path.
Coarse Grained • Decompose problem into large independent
tasks. Usually there is no communication between the tasks. Also defined as a class of parallelism where: “relatively large amounts of computational work is done between communication”
Fine Grained • Decompose problem into smaller inter-
dependent tasks. Usually these tasks are communication intensive. Also defined as a class of parallelism where: “relatively small amounts of computational work are done between communication events” –www.llnl.gov/computing/tutorials/parallel_comp
20
Images adapted from : http://www.mhpcc.edu/training/workshop/parallel_intro/
Overhead
Computa1on
Coarse Grained
Overhead
Computa1on
Finely Grained
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Overhead
21
• Overhead: Additional critical path (in time) work required to manage parallel resources and concurrent tasks that would not be necessary for purely sequential execution
• V is total overhead of workload execution • vi is overhead for individual task wi
• Each task takes vi +wi time steps to complete • Overhead imposes upper bound on scalability
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Overhead
22
v w
V+W=4v+4w
wi =WP
Tn = v +Wn
S = T1TP
=W +!WP+!
!W
WP+!
=P
1+ P"!W
=P
1+ !WP
v = overhead V = Total overhead w = work unit W = Total work Ti = execu1on 1me with i processors P = # processors
W = wii=1
P
!Assump1on : Workload is
infinitely divisible
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Scalability and Overhead for fixed sized work tasks
23
• W is divided in to J tasks of size wg • Each task requires v overhead work to manage • For P processors there are approximates J/P tasks to be
performed in sequence so, • TP is J(wg + v)/P • Note that S = T1 / TP
• So, S = P / (1 + v / wg)
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Scalability & Overhead
24
J = # tasks = Wwg
!
"""
#
$$$%Wwg
T1 =W + v %W
TP =JP& wg + v( ) = W
Pwg
& (wg + v) =WP1+ v
wg
'
())
*
+,,
TP =WP1+ v
wg
'
())
*
+,,
S = T1TP
-W
WP1+ v
wg
'
())
*
+,,
=P
1+ vwg
when W >> v
v = overhead wg = work unit W = Total work Ti = execu1on 1me with i processors P = # Processors J = # Tasks
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
25
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Capacity Computing with basic Unix tools
• Combination of common Unix utilities such as ssh, scp, rsh, rcp can be used to remotely create jobs (to get more information about these commands try man ssh, man scp, man rsh, man rcp on any Unix shell)
• For small workloads it can be convenient to translate the execution of the program into a simple shell script.
• Relying on simple Unix utilities poses several application management constraints for cases such as : – Aborting started jobs – Querying for free machines – Querying for job status – Retrieving job results – etc..
26
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
BOINC , Seti@Home • BOINC (Berkley Open Infrastructure for Network Computing) • Opensource software that enables distributed coarse grained
computations over the internet. • Follows the Master-Worker model, in BOINC : no
communication takes place among the worker nodes • SETI@Home • Einstein@Home • Climate prediction • And many more…
27
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
28
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Management Middleware : Condor
• Designed, developed and maintained at University of Wisconsin Madison by a team lead by Miron Livny
• Condor is a versatile workload management system for managing pool of distributed computing resources to provide high capacity computing.
• Assists distributed job management by providing mechanisms for job queuing, scheduling, priority management, tools that facilitate utilization of resources across Condor pools
• Condor also enables resource management by providing monitoring utilities, authentication & authorization mechanisms, condor pool management utilities and support for Grid Computing middleware such as Globus.
• Condor Components • ClassAds • Matchmaker • Problem Solvers
29
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Condor Components : Class Ads • ClassAds (Classified Advertisements) concept is
very similar to the newspaper classifieds concepts where buyers and sellers advertise their products using abstract yet uniquely defining named expressions. Example : Used Car Sales
• ClassAds language in Condor provides well defined means of describing the User Job and the end resources ( storage / computational ) so that the Condor MatchMaker can match the job with the appropriate pool of resources.
Management Middleware : Condor
Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and
Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. http://www.cs.wisc.edu/condor/doc/condor-practice.pdf
30
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Job ClassAd & Machine ClassAd
31
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Condor MatchMaker • MatchMaker, a crucial part of the Condor
architecture, uses the job description classAd provided by the user and matches the Job to the best resource based on the Machine description classAd
• MatchMaking in Condor is performed in 4 steps : 1. Job Agent (A) and resources (R) advertise themselves. 2. Matchmaker (M) processes the known classAds and
generates pairs that best match resources and jobs 3. Matchmaker informs each party of the job-resource pair of
their prospective match. 4. The Job agent and resource establish connection for further
processing. (Matchmaker plays no role in this step, thus ensuring separation between selection of resources and subsequent activities)
Management Middleware : Condor
Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and
Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.
http://www.cs.wisc.edu/condor/doc/condor-practice.pdf
32
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
33
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Condor Problem Solvers • Master-Worker (MW) is a problem solving system that is
useful for solving a coarse grained problem of indeterminate size such as parameter sweep etc.
• The MW Solver in Condor consists of 3 main components : work-list, a tracking module, and a steering module. The work-list keeps track of all pending work that master needs done. The tracking module monitors progress of work currently in progress on the worker nodes. The steering module directs computation based on results gathered and the pending work-list and communicates with the matchmaker to obtain additional worker processes.
• DAGMan is used to execute multiple jobs that have dependencies represented as a Directed Acyclic Graph where the nodes correspond to the jobs and edges correspond to the dependencies between the jobs. DAGMan provides various functionalities for job monitoring and fault tolerance via creation of rescue DAGs.
Management Middleware : Condor
34
Master
w1 w..N
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Management Middleware : Condor
Indepth Coverage : h^p://www.cs.wisc.edu/condor/publica1ons.html
Recommended Reading : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Compu1ng in Prac1ce: The Condor Experience"
Concurrency and Computa+on: Prac+ce and Experience, Vol. 17, No. 2-‐4, pages 323-‐356, February-‐April, 2005. [PDF]
Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny, "Condor -‐ A Distributed Job Scheduler", in Thomas Sterling, editor, Beowulf Cluster Compu+ng with Linux, The MIT Press, 2002.
ISBN: 0-‐262-‐69274-‐0 [Postscript] [PDF]
35
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Core components of Condor • condor_master: This program runs constantly and ensures that all other parts of Condor
are running. If they hang or crash, it restarts them. • condor_collector: This program is part of the Condor central manager. It collects
information about all computers in the pool as well as which users want to run jobs. It is what normally responds to the condor_status command. It's not running on your computer, but on the main Condor pool host (Arete head node).
• condor_negotiator: This program is part of the Condor central manager. It decides what jobs should be run where. It's not running on your computer, but on the main Condor pool host (Arete head node).
• condor_startd: If this program is running, it allows jobs to be started up on this computer--that is, Arete is an "execute machine". This advertises Arete to the central manager (more on that later) so that it knows about this computer. It will start up the jobs that run.
• condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, desktron is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started.
• condor_shadow For each job that has been submitted from this computer (e.g., desktron), there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance. You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out.
36
Source : http://www.cs.wisc.edu/condor/tutorials/cw2005-condor/intro.html
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Condor : A Walkthrough of Condor commands
condor_status : provides current pool status condor_q : provides current job queue condor_submit : submit a job to condor pool condor_rm : delete a job from job queue
37
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
What machines are available ? (condor_status)
condor_status queries resource information sources and provides the current status of the condor pool of resources
38
§ Some common condor_status command line options : § -‐help : displays usage informa1on § -‐avail : queries condor_startd ads and prints informa1on about available
resources § -‐claimed : queries condor_startd ads and prints informa1on about claimed
resources § -‐ckptsrvr : queries condor_ckpt_server ads and display checkpoint server
a^ributes § -‐pool hostname queries the specified central manager (by default queries
$COLLECTOR_HOST) § -‐verbose : displays en1re classads § For more op1ons and what they do run “condor_status –help”
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
condor_status : Resource States
• Owner : The machine is currently being utilized by a user. The machine is currently unavailable for jobs submitted by condor until the current user job completes.
• Claimed : Condor has selected the machine for use by other users.
• Unclaimed : Machine is unused and is available for selection by condor.
• Matched : Machine is in a transition state between unclaimed and claimed
• Preempting : Machine is currently vacating the resource to make it available to condor.
39
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Example : condor_status
40
[cdekate@celeritas ~]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:23 vm2@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:24 vm3@compute-‐0 LINUX X86_64 Unclaimed Idle 0.010 1964 0+00:45:06 vm4@compute-‐0 LINUX X86_64 Owner Idle 1.000 1964 0+00:00:07 vm1@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:25 vm2@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 1+09:05:58 vm3@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:37:27 vm4@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 0+00:05:07 … … vm3@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:33 vm4@compute-‐0 LINUX X86_64 Unclaimed Idle 0.000 1964 3+13:42:34 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 32 3 0 29 0 0 0 Total 32 3 0 29 0 0 0
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
What jobs are currently in the queue? condor_q
• condor_q provides a list of job that have been submitted to the Condor pool
• Provides details about jobs including which cluster the job is running on, owner of the job, memory consumption, the name of the executable being processed, current state of the job, when the job was submitted and how long has the job been running.
41
§ Some common condor_q command line options : § -‐global : queries all job queues in the pool § -‐name : queries based on the schedd name provides a queue lis1ng of the named
schedd § -‐claimed : queries condor_startd ads and prints informa1on about claimed resources § -‐goodput : displays job goodput sta1s1cs (“goodput is the allocation time when an
application uses a remote workstation to make forward progress.” – Condor Manual)
§ -cputime : displays the remote CPU time accumulated by the job to date... § For more op1ons run : “condor_q –help”
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
[cdekate@celeritas ~]$ condor_q -‐-‐ Submitter: celeritas.cct.lsu.edu : <130.39.128.68:40472> : celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 30.0 cdekate 1/23 07:52 0+00:01:13 R 0 9.8 fib 100 30.1 cdekate 1/23 07:52 0+00:01:09 R 0 9.8 fib 100 30.2 cdekate 1/23 07:52 0+00:01:07 R 0 9.8 fib 100 30.3 cdekate 1/23 07:52 0+00:01:11 R 0 9.8 fib 100 30.4 cdekate 1/23 07:52 0+00:01:05 R 0 9.8 fib 100 5 jobs; 0 idle, 5 running, 0 held [cdekate@celeritas ~]$
42
Example : condor_q
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
How to submit your Job ? condor_submit
• Create a job classAd (condor submit file) that contains Condor keywords and user configured values for the keywords.
• Submit the job classAd using “condor_submit” • Example :
condor_submit matrix.submit • condor_submit –h provides additional flags
43
[cdekate@celeritas NPB3.2-‐MPI]$ condor_submit -‐h Usage: condor_submit [options] [cmdfile] Valid options: -‐verbose verbose output -‐name <name> submit to the specified schedd -‐remote <name> submit to the specified remote schedd (implies -‐spool) -‐append <line> add line to submit file before processing (overrides submit file; multiple -‐a lines ok) -‐disable disable file permission checks -‐spool spool all files to the schedd -‐password <password> specify password to MyProxy server -‐pool <host> Use host as the central manager to query If [cmdfile] is omitted, input is read from stdin
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
condor_submit : Example
44
[cdekate@celeritas ~]$ condor_submit fib.submit Submitting job(s)..... Logging submit event(s)..... 5 job(s) submitted to cluster 35. [cdekate@celeritas ~]$ condor_q -‐-‐ Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> : celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 35.0 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 10 35.1 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 15 35.2 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 20 35.3 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 25 35.4 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 30 5 jobs; 5 idle, 0 running, 0 held [cdekate@celeritas ~]$
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
How to delete a submitted job ? condor_rm
• condor_rm : Deletes one or more jobs from Condor job pool. If a particular Condor pool is specified as one of the arguments then the condor_schedd matching the specification is contacted for job deletion, else the local condor_schedd is contacted.
45
[cdekate@celeritas ~]$ condor_rm -‐h Usage: condor_rm [options] [constraints] where [options] is zero or more of: -‐help Display this message and exit -‐version Display version information and exit -‐name schedd_name Connect to the given schedd -‐pool hostname Use the given central manager to find daemons -‐addr <ip:port> Connect directly to the given "sinful string" -‐reason reason Use the given RemoveReason -‐forcex Force the immediate local removal of jobs in the X state (only affects jobs already being removed) and where [constraints] is one or more of: cluster.proc Remove the given job cluster Remove the given cluster of jobs user Remove all jobs owned by user -‐constraint expr Remove all jobs matching the boolean expression -‐all Remove all jobs (cannot be used with other constraints) [cdekate@celeritas ~]$
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
[cdekate@celeritas ~]$ condor_q -‐-‐ Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> :
celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 41.0 cdekate 1/24 15:43 0+00:00:03 R 0 9.8 fib 100 41.1 cdekate 1/24 15:43 0+00:00:01 R 0 9.8 fib 150 41.2 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 200 41.3 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 250 41.4 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 300 5 jobs; 0 idle, 5 running, 0 held [cdekate@celeritas ~]$ condor_rm 41.4 Job 41.4 marked for removal [cdekate@celeritas ~]$ condor_rm 41 Cluster 41 has been marked for removal. [cdekate@celeritas ~]$
46
condor_rm : Example
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Creating Condor submit file ( Job a ClassAd )
• Condor submit file contains key-value pairs that help describe the application to condor.
• Condor submit files are job ClassAds. • Some of the common descriptions found in the job
ClassAds are :
47
executable = (path to the executable to run on Condor) input = (standard input provided as a file) output = (standard output stored in a file) log = (output to log file) arguments = (arguments to be supplied to the queue)
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
DEMO : Steps involved in running a job on Condor.
1. Creating a Condor submit file 2. Submitting the Condor submit file to a Condor pool 3. Checking the current state of a submitted job 4. Job status Notification
48
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Condor Usage Statistics
49
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Montage workload implemented and executed using Condor ( Source : Dr. Dan Katz )
• Mosaicking astronomical images : • Powerful Telescopes taking high resolution (and highest zoom) pictures of the sky can cover small region over time • Problem being solved in this project is “stitching” these images together to make a high-resolution zoomed in
snapshot of the sky. • Aggregate requirements of 140000 CPU hours (~16 years on a single machine) output ranging in the order of 6
TeraBytes
50
Example DAG for 10 input files
mAdd
mBackground
mBgModel
mProject
mDiff
mFitPlane
mConcatFit
Data Stage-in nodes
Montage compute nodes
Data stage-out nodes
Registration nodes
Pegasus
Grid Information Systems
Information about available resources, data
location
Grid
Condor DAGMan
Maps an abstract workflow ���to an executable form
Executes the workflow
MyProxy
User’s grid credentials
h^p://pegasus.isi.edu/
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Montage Use By IPHAS: The INT/WFC Photometric H-alpha Survey of the Northern Galactic Plane
(Source : Dr. Dan Katz)
Supernova remnant S147
Nebulosity in vicinity of HII region, IC 1396B, in Cepheus
Crescent Nebula NGC 6888
Study extreme phases of stellar evolu1on that
involve very large mass loss
51
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
52
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
53
• Throughput computing • Performance measured as total workload performed over time
to complete • Overhead factors
– Start up time – Input data distribution – Output result data collection – Terminate time – Inter-task coordination overhead (No task coupling)
• Starvation – Insufficient work to keep all processors busy – Inadequate parallelism of coarse grained task parallelism – Poor or uneven load distribution
Capacity Computing Performance Issues
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Topics
• Key terms and concepts • Basic definitions • Models of parallelism • Speedup and Overhead • Capacity Computing & Unix utilities • Condor : Overview • Condor : Useful commands • Performance Issues in Capacity Computing • Material for Test
54
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
Summary : Material for the Test • Key terms & Concepts (4,5,7,8,9,10,11) • Decoupled work-queue model (16) • Ideal speedup (18,19) • Overhead and Scalability (20,21,22,23,24) • Understand Condor concepts detailed in slides (30,
31,32, 34,35, 36,37) • Capacity computing performance issues (53) • Required reading materials :
– http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf – Specific pages to focus on : 3-16
55
CSC 7600 Lecture 5 : Capacity Compu1ng, Spring 2011
56
CSC 7600 Lecture 7 : MPI1 Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
MESSAGE PASSING INTERFACE MPI
(PART A)
Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityFebruary 8, 2011
CSC 7600 Lecture 7 : MPI1 Spring 2011
2
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
3
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
Opening Remarks
• Context: distributed memory parallel computers
• We have communicating sequential processes, each
with their own memory, and no access to another
process‟s memory
– A fairly common scenario from the mid 1980s (Intel Hypercube)
to today
– Processes interact (exchange data, synchronize) through
message passing
– Initially, each computer vendor had its own library and calls
– First standardization was PVM
• Started in 1989, first public release in 1991
• Worked well on distributed machines
• Next was MPI
4
CSC 7600 Lecture 7 : MPI1 Spring 2011
What you‟ll Need to Know
• What is a standard API
• How to build and run an MPI-1 program
• Basic MPI functions
– 4 basic environment functions
• Including the idea of communicators
– Basic point-to-point functions
• Blocking and non-blocking
• Deadlock and how to avoid it
• Data types
– Basic collective functions
• The advanced MPI-1 material may be required for the
problem set
• The MPI-2 highlights are just for information
5
CSC 7600 Lecture 7 : MPI1 Spring 2011
6
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI Standard
• From 1992-1994, a community representing both
vendors and users decided to create a standard
interface to message passing calls in the context of
distributed memory parallel computers (MPPs, there
weren‟t really clusters yet)
• MPI-1 was the result
– “Just” an API
– FORTRAN77 and C bindings
– Reference implementation (mpich) also developed
– Vendors also kept their own internals (behind the API)
7
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI Standard
• Since then– MPI-1.1
• Fixed bugs, clarified issues
– MPI-2
• Included MPI-1.2
– Fixed more bugs, clarified more issues
• Extended MPI
– New datatype constructors, language interoperability
• New functionality
– One-sided communication
– MPI I/O
– Dynamic processes
• FORTRAN90 and C++ bindings
• Best MPI reference– MPI Standard - on-line at: http://www.mpi-forum.org/
8
CSC 7600 Lecture 7 : MPI1 Spring 2011
9
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI : Basics
• Every MPI program must contain the preprocessor directive
• The mpi.h file contains the definitions and declarations necessary for
compiling an MPI program.
• mpi.h is usually found in the “include” directory of most MPI
installations. For example on arete:
10
#include "mpi.h"
...
#include “mpi.h”...
MPI_Init(&Argc,&Argv);
...
...
MPI_Finalize();
...
CSC 7600 Lecture 7 : MPI1 Spring 2011
11
MPI: Initializing MPI Environment
Function: MPI_init()
int MPI_Init(int *argc, char ***argv)
Description:Initializes the MPI execution environment. MPI_init() must be called before any other MPI functions can be called and it should be called only once. It allows systems to do any special setup so that MPI Library can be used. argc is a pointer to the number of arguments and argv is a pointer to the argument vector. On exit from this routine, all processes will have a copy of the argument list.
...
#include “mpi.h”
...
MPI_Init(&argc,&argv);...
...
MPI_Finalize();
...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Init.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
12
MPI: Terminating MPI Environment
Function: MPI_Finalize()
int MPI_Finalize()
Description:Terminates MPI execution environment. All MPI processes must call this routine before exiting. MPI_Finalize() need not be the last executable statement or even in main; it must be called at somepoint following the last call to any other MPI function.
...
#include ”mpi.h”
...
MPI_Init(&argc,&argv);
...
...
MPI_Finalize();...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Finalize.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI Hello World
• C source file for a simple MPI Hello World
13
#include "mpi.h"#include <stdio.h>
int main( int argc, char *argv[]){
MPI_Init( &argc, &argv);printf("Hello, World!\n");MPI_Finalize();return 0;
}
Include header files
Initialize MPI Context
Finalize MPI Context
CSC 7600 Lecture 7 : MPI1 Spring 2011
Building an MPI Executable
• Library version
– User knows where header file and library are, and tells compiler
gcc -Iheaderdir -Llibdir mpicode.c –lmpich
• Wrapper version
– Does the same thing, but hides the details from the user
mpicc -o executable mpicode.c
You can do either one, but don't try to do both!
– use "sh -x mpicc -o executable mpicode.c" to figure out the gcc line
For our “Hello World” example on arete use:
mpicc -o hello hello.c
14
gcc -m64 -O2 -fPIC -Wl,-z,noexecstack -o hello hello.c -I/usr/include/mpich2-x86_64 -L/usr/lib64/mpich2/lib -L/usr/lib64/mpich2/lib -Wl,-rpath,/usr/lib64/mpich2/lib -lmpich -lopa -lpthread -lrt
OR
CSC 7600 Lecture 7 : MPI1 Spring 2011
Running an MPI Executable
• Some number of processes are started somewhere
– Again, standard doesn‟t talk about this
– Implementation and interface varies
– Usually, some sort of mpiexec command starts some number of copies
of an executable according to a mapping
– Example:
„mpiexec -n 2 ./a.out’ command runs two copies of ./a.out where the system
specifies number of processes to be 2
– Most production supercomputing resources wrap the mpiexec command with
higher level scripts that interact with scheduling systems such as PBS /
LoadLeveler for efficient resource management and multi-user support
– Sample PBS / Load Leveler job submission scripts :PBS File:
#!/bin/bash#PBS -l walltime=120:00:00,nodes=8:ppn=4cd /home/cdekate/S1_L2_Demos/adc/pwddatePROCS=`wc -l < $PBS_NODEFILE`mpdboot --file=$PBS_NODEFILEmpiexec -n $PROCS ./padcircmpdallexitdate
LoadLeveler File: #!/bin/bash#@ job_type = parallel#@ job_name = SIMID#@ wall_clock_limit = 120:00:00#@ node = 8#@ total_tasks = 32#@ initialdir = /scratch/cdekate/#@ executable = /usr/bin/poe#@ arguments = /scratch/cdekate/padcirc #@ queue
15
CSC 7600 Lecture 7 : MPI1 Spring 2011
Running the Hello World example
• Using mpiexec : • Using PBS
16
mpd &mpiexec -n 8 ./helloHello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!
hello.pbs : #!/bin/bash#PBS -N hello#PBS -l walltime=00:01:00,nodes=2:ppn=4cd /home/cdekate/2008/l7wddatePROCS=`wc -l < $PBS_NODEFILE`mpdboot -f $PBS_NODEFILEmpiexec -n $PROCS ./hellompdallexitdate
more hello.o10030 /home/cdekate/2008/l7Wed Feb 6 10:58:36 CST 2008Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Hello, World!Wed Feb 6 10:58:37 CST 2008
CSC 7600 Lecture 7 : MPI1 Spring 2011
17
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI Communicators
• Communicator is an internal object
• MPI Programs are made up of communicating processes
• Each process has its own address space containing its own attributes such as rank, size (and argc, argv, etc.)
• MPI provides functions to interact with it
• Default communicator is MPI_COMM_WORLD
– All processes are its members
– It has a size (the number of processes)
– Each process has a rank within it
– One can think of it as an ordered list of processes
• Additional communicator(s) can co-exist
• A process can belong to more than one communicator
• Within a communicator, each process has a unique rank
MPI_COMM_WORLD
0
12
5
3
4
6
7
18
CSC 7600 Lecture 7 : MPI1 Spring 2011
19
MPI: Size of Communicator
Function: MPI_Comm_size()
int MPI_Comm_size ( MPI_Comm comm, int *size )
Description:Determines the size of the group associated with a communicator (comm). Returns an integer number of processes in the group underlying comm executing the program. If comm is an inter-communicator (i.e. an object that has processes of two inter-communicating groups) , return the size of the local group (a size of a group where request is initiated from). The comm in the argument list refers to the communicator-group to be queried, the result of the query (size of the comm group) is stored in the variable size.
...
#include “mpi.h”
...
int size;MPI_Init(&Argc,&Argv);
...
MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);
...
err = MPI_Finalize();
...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Comm_size.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
20
MPI: Rank of a process in comm
Function: MPI_Comm_rank()
int MPI_Comm_rank ( MPI_Comm comm, int *rank )
Description:Returns the rank of the calling process in the group underlying the comm. If the comm is an inter-communicator, the call MPI_Comm_rank returns the rank of the process in the local group. The first parameter comm in the argument list is the communicator to be queried, and the second parameter rank is the integer number rank of the process in the group of comm.
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Comm_rank.html
...
#include “mpi.h”
...
int rank;MPI_Init(&Argc,&Argv);
...
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);...
err = MPI_Finalize();
...
CSC 7600 Lecture 7 : MPI1 Spring 2011
Example : communicators
21
#include "mpi.h"#include <stdio.h>
int main( int argc, char *argv[]){
int rank, size;MPI_Init( &argc, &argv);MPI_Comm_rank( MPI_COMM_WORLD, &rank);MPI_Comm_size( MPI_COMM_WORLD, &size);printf("Hello, World! from %d of %d\n", rank, size );MPI_Finalize();return 0;
}
Determines the rank of the current process in the communicator-group
MPI_COMM_WORLD
Determines the size of the communicator-group MPI_COMM_WORLD
… Hello, World! from 1 of 8Hello, World! from 0 of 8Hello, World! from 5 of 8…
CSC 7600 Lecture 7 : MPI1 Spring 2011
Example : Communicator & Rank
• Compiling :
• Result :
22
mpicc -o hello2 hello2.c
Hello, World! from 4 of 8Hello, World! from 3 of 8Hello, World! from 1 of 8Hello, World! from 0 of 8Hello, World! from 5 of 8Hello, World! from 6 of 8Hello, World! from 7 of 8Hello, World! from 2 of 8
CSC 7600 Lecture 7 : MPI1 Spring 2011
23
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI : Point to Point Communication
primitives
• A basic communication mechanism of MPI between a pair of processes in which one process is sending data and the other process receiving the data, is called “point to point communication”
• Message passing in MPI program is carried out by 2 main MPI functions– MPI_Send – sends message to a designated process
– MPI_Recv – receives a message from a process
• Each of the send and recv calls is appended with additional information along with the data that needs to be exchanged between application programs
• The message envelope consists of the following information– The rank of the receiver
– The rank of the sender
– A tag
– A communicator
• The source argument is used to distinguish messages received from different processes
• Tag is user-specified int that can be used to distinguish messages from a single process
24
CSC 7600 Lecture 7 : MPI1 Spring 2011
Message Envelope
• Communication across processes is performed using messages.
• Each message consists of a fixed number of fields that is used to distinguish them, called the Message Envelope :
– Envelope comprises source, destination, tag, communicator
– Message = Envelope + Data
• Communicator refers to the namespace associated with the group of related processes
25
MPI_COMM_WORLD
0
12
5
3
4
6
7
Source : process0Destination : process1Tag : 1234Communicator : MPI_COMM_WORLD
CSC 7600 Lecture 7 : MPI1 Spring 2011
26
MPI: (blocking) Send message
Function: MPI_Send()
int MPI_Send(
void *message,
int count,
MPI_Datatype datatype,
int dest,
int tag,
MPI_Comm comm )
Description:The contents of message are stored in a block of memory referenced by the first parameter
message. The next two parameters, count and datatype, allow the system to determine how much
storage is needed for the message: the message contains a sequence of count values, each having
MPI type datatype. MPI allows a message to be received as long as there is sufficient storage
allocated. If there isn't sufficient storage an overflow error occurs. The dest parameter corresponds
to the rank of the process to which message has to be sent.
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Send.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI : Data Types
MPI datatype C datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
27
You can also define your own (derived datatypes), such as an array of ints of size 100, or more complex examples, such as a struct or an array of structs
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI: (blocking) Receive message
28
Function: MPI_Recv()
int MPI_Recv(
void *message,
int count,
MPI_Datatype datatype,
int source,
int tag,
MPI_Comm comm,
MPI_Status *status )
Description:The contents of message are stored in a block of memory referenced by the first parameter message. The
next two parameters, count and datatype, allow the system to determine how much storage is needed for
the message: the message contains a sequence of count values, each having MPI type datatype. MPI
allows a message to be received as long as there is sufficient storage allocated. If there isnt sufficient
storage an overflow error occurs. The source parameter corresponds to the rank of the process from which
the message has been received. The MPI_Status parameter in the MPI_Recv() call returns information on
the data that was actually received It references a record with 2 fields – one for the source and one for the
tag http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Recv.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI_Status object
29
Object: MPI_Status
Example usage :MPI_Status status;
Description:The MPI_Status object is used by the receive functions to return data about the message, specifically the object contains the id of the process sending the message (MPI_SOURCE), the message tag (MPI_TAG), and error status (MPI_ERROR) .
#include "mpi.h"…
MPI_Status status; /* return status for */…MPI_Init(&argc, &argv);…if (my_rank != 0) {…
MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);}else { /* my rank == 0 */
for (source = 1; source < p; source++ ) {
MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);…MPI_Finalize();
…
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI : Example send/recv
30
/* hello world, MPI style */
#include "mpi.h"#include <stdio.h>#include <string.h>
int main(int argc, char* argv[]) {int my_rank; /* rank of process */int p; /* number of processes */int source; /* rank of sender */int dest; /* rank of receiver */
int tag=0; /* tag for messages */char message[100]; /* storage for message */MPI_Status status; /* return status for */
/* receive */
/* Start up MPI */MPI_Init(&argc, &argv);
/* Find out process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Find out number of processes */MPI_Comm_size(MPI_COMM_WORLD, &p);
Src : Prof. Amy Apon
if (my_rank != 0) {/* Create message */sprintf(message, "Greetings from process %d!", my_rank);dest = 0;/* Use strlen+1 so that \0 gets transmitted */MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);}else { /* my rank == 0 */
for (source = 1; source < p; source++ ) {MPI_Recv(message, 100, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &status);printf("%s\n", message);
}printf("Greetings from process %d!\n", my_rank);
}
/* Shut down MPI */MPI_Finalize();
} /* end main */
CSC 7600 Lecture 7 : MPI1 Spring 2011
Communication map for the example.
31
mpiexec -n 8 ./hello3Greetings from process 1!Greetings from process 2!Greetings from process 3!Greetings from process 4!Greetings from process 5!Greetings from process 6!Greetings from process 7!Greetings from process 0!Writing logfile....Finished writing logfile.[cdekate@celeritas l7]$
CSC 7600 Lecture 7 : MPI1 Spring 2011
32
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point-to-point Communication
• How two processes interact
• Most flexible communication in MPI
• Two basic varieties
– Blocking and non-blocking
• Two basic functions
– Send and receive
• With these two functions, and the four functions
we already know, you can do everything in MPI
– But there's probably a better way to do a lot things,
using other functions
33
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication :
Basic concepts (buffered)
Kernel modeUser mode
Process 0
Kernel modeUser mode
Process 1
Call send Subroutine
Return from send
Subroutine
Copy data from sendbuf to
sysbuf
Send data to the sysbuf at
the receiving end
Call receive Subroutine
Return from receive
Subroutine
Receive data from the sysbuf
at the sending end
Copy data from sysbuf to
recvbuf
sendbuf
sysbuf
sysbuf
recvbuf
Step 1
Step 2
Step 3
1. Data to be sent by the user is copied from the user memory space to the system buffer
2. The data is sent from the system buffer over the network to the system buffer of receiving process
3. The receiving process copies the data from system buffer to local user memory space
34
CSC 7600 Lecture 7 : MPI1 Spring 2011
MPI communication modes
• MPI offers several different types of communication modes, each having implications on data handling and performance:– Buffered
– Ready
– Standard
– Synchronous
• Each of these communication modes has both blocking and non-blocking primitives– In blocking point to point communication the send call blocks until the send block
can be reclaimed. Similarly the receive function blocks until the buffer has successfully obtained the contents of the message.
– In the non-blocking point to point communication the send and receive calls allow the possible overlap of communication with computation. Communication is usually done in 2 phases: the posting phase and the test for completion phase.
• Synchronization Overhead: the time spent waiting for an event to occur on another task.
• System Overhead: the time spent when copying the message data from sender‟s message buffer to network and from network to the receiver‟s message buffer.
35
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication
Blocking Synchronous Send
• The communication mode is selected while invoking the send routine.
• When blocking synchronous send is executed (MPI_Ssend()) , “ready to send” message is sent from the sending task to receiving task.
• When the receive call is executed (MPI_Recv()), “ready to receive” message is sent, followed by the transfer of data.
• The sender process must wait for the receive to be executed and for the handshake to arrive before the message can be transferred. (Synchronization Overhead)
• The receiver process also has to wait for the handshake process to complete. (Synchronization Overhead)
• Overhead incurred while copying from sender & receiver buffers to the network.
36
http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication
Blocking Ready Send
• The ready mode send call (MPI_Rsend) sends the message over the network
once the “ready to receive” message is received.
• If “ready to receive” message hasn‟t arrived, the ready mode send will incur an
error and exit. The programmer is responsible to provide for handling errors and
overriding the default behavior.
• The ready mode send call minimizes system overhead and synchronization
overhead incurred during sending of the task.
• The receive still incurs substantial synchronization overhead depending on how
much earlier the receive call is executed.
37
http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication
Blocking Buffered Send
• The blocking buffered send call (MPI_Bsend()) copies the data from the message buffer
to a user-supplied buffer and then returns.
• The message buffer can then be reclaimed by the sending process without having any
effect on any data that is sent.
• When the “ready to receive” notification is received the data from the user-supplied buffer
is sent to the receiver.
• Replicated copies of the buffer results in added system overhead.
• Synchronization overhead on the sender process is eliminated as the sending process
does not have to wait on the receive call.
• Synchronization overhead on the receiving process can still be incurred, because if the
receive is executed before the send, the process must wait before it can return to the
execution sequence
38
http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication
Blocking Standard Send
• The MPI_Send() operation is implementation dependent
• When the data size is smaller than a threshold value (varies for each implementation):
– The blocking standard send call (MPI_Send()) copies the message over the network
into the system buffer of the receiving node, after which the sending process continues
with the computation
– When the receive call (MPI_Recv()) is executed the message is copied from the
system buffer to the receiving task
– The decreased synchronization overhead is usually at the cost of increased system
overhead due to the extra copy of buffers
39
http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication
Buffered Standard Send
• When the message size is greater than a threshold– The behavior is same as for the synchronous mode
– Small messages benefit from the decreased chance of synchronization overhead
– Large messages results in increased cost of copying to the buffer and system overhead
40
http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication
Non-blocking Calls
• The non-blocking send call (MPI_Isend()) posts a non-blocking standard send when
the message buffer contents are ready to be transmitted
• The control returns immediately without waiting for the copy to the remote system
buffer to complete. MPI_Wait is called just before the sending task needs to overwrite
the message buffer
• Programmer is responsible for checking the status of the message to know whether
data to be sent has been copied out of the send buffer
• The receiving call (MPI_Irecv()) issues a non-blocking receive as soon as a message
buffer is ready to hold the message. The non-blocking receive returns without waiting
for the message to arrive. The receiving task calls MPI_Wait when it needs to use the
incoming message data
41
http://ib.cnea.gov.ar/~ipc/ptpde/mpi-class/3_pt2pt2.html
CSC 7600 Lecture 7 : MPI1 Spring 2011
Point to Point Communication
Non-blocking Calls
• When the system buffer is full, the blocking send would have to wait until the receiving task pulled some message data out of the buffer. Use of non-blocking call allows computation to be done during this interval, allowing for interleaving of computation and communication
• Non-blocking calls ensure that deadlock will not result
42
CSC 7600 Lecture 7 : MPI1 Spring 2011
43
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
Deadlock
• Something to avoid
• A situation where the dependencies between processors
are cyclic
– One processor is waiting for a message from another processor,
but that processor is waiting for a message from the first, so
nothing happens
• Until your time in the queue runs out and your job is killed
• MPI does not have timeouts
44
CSC 7600 Lecture 7 : MPI1 Spring 2011
Deadlock Example
• If the message sizes are small enough, this should
work because of systems buffers
• If the messages are too large, or system buffering is
not used, this will hang
If (rank == 0) {
err = MPI_Send(sendbuf, count, datatype, 1, tag, comm);
err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm, &status);
}else {
err = MPI_Send(sendbuf, count, datatype, 0, tag, comm);
err = MPI_Recv(recvbuf, count, datatype, 0, tag, comm, &status);
}
45
CSC 7600 Lecture 7 : MPI1 Spring 2011
Deadlock Example Solutions
If (rank == 0) {
err = MPI_Send(sendbuf, count, datatype, 1, tag, comm);
err = MPI_Recv(recvbuf, count, datatype, 1, tag, comm, &status);
}else {
err = MPI_Recv(recvbuf, count, datatype, 0, tag, comm, &status);
err = MPI_Send(sendbuf, count, datatype, 0, tag, comm);
}
or
If (rank == 0) {
err = MPI_Isend(sendbuf, count, datatype, 1, tag, comm, &req);
err = MPI_Irecv(recvbuf, count, datatype, 1, tag, comm);
err = MPI_Wait(req, &status);
}else {
err = MPI_Isend(sendbuf, count, datatype, 0, tag, comm, &req);
err = MPI_Irecv(recvbuf, count, datatype, 0, tag, comm);
err = MPI_Wait(req, &status);
}
46
CSC 7600 Lecture 7 : MPI1 Spring 2011
47
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
Numerical Integration Using Trapezoidal
Rule: A Case Study
• In review, the 6 main MPI calls:
– MPI_Init
– MPI_Finalize
– MPI_Comm_size
– MPI_Comm_rank
– MPI_Send
– MPI_Recv
• Using these 6 MPI function calls we can begin to
construct several kinds of parallel applications
• In the following section we discuss how to use these 6
calls to parallelize Trapezoidal Rule
48
CSC 7600 Lecture 7 : MPI1 Spring 2011
Approximating Integrals: Definite Integral
• Problem : to find an approximate value to a definite
integral
• A definite integral from a to b of a non negative function
f(x) can be thought of as the area bound by the X-axis,
the vertical lines x=a and x=b, and the graph of f(x)
49
CSC 7600 Lecture 7 : MPI1 Spring 2011
Approximating Integrals : Trapezoidal Rule
• Approximating area under the curve can be
done by dividing the region under the curve into
regular geometric shapes and then adding the
areas of the shapes.
• In Trapezoidal Rule, the region between a and b
can be divided into n trapezoids of base
h = (b-a)/n
• The area of a trapezoid can be calculated as
• In the case of our function the area for the first
block can be represented as
• The area under the curve bounded by a & b can
be approximated as :
50
h ( f ( a ) f (a h ))
2
h ( f (a ) f ( a h ))
2
h ( f ( a h ) f (a 2 h ))
2
h ( f (a 2 h ) f (a 3 h ))
2
h ( f ( a 3h ) f (b ))
2
h (b1
b2)
2
h (b1
b2)
2
CSC 7600 Lecture 7 : MPI1 Spring 2011
Approximating Integrals: Trapezoid Rule
• We can further generalize this concept of approximation
of integrals as a summation of trapezoidal areas
51
1
2h [ f ( x
0) f ( x
1)]
1
2h [ f ( x
1) f ( x
2)] ...
1
2h [ f ( x
n 1) f ( x
n)]
h
2[ f ( x
0) f ( x
1) f ( x
1) f ( x
2) ... f ( x
n 1) f ( x
n)]
h
2[ f ( x
0) 2 f ( x
1) 2 f ( x
2) ... 2 f ( x
n 1) f ( x
n)]
hf ( x
0)
2f ( x
1) f ( x
2) ... f ( x
n 1)
f ( xn)
2
CSC 7600 Lecture 7 : MPI1 Spring 2011
Trapezoidal Rule – Serial / Sequential
program in C
52
/* serial.c -- serial trapezoidal rule** Calculate definite integral using trapezoidal rule.* The function f(x) is hardwired.* Input: a, b, n.* Output: estimate of integral from a to b of f(x)* using n trapezoids.** See Chapter 4, pp. 53 & ff. in PPMPI.*/
#include <stdio.h>
main() {float integral; /* Store result in integral */float a, b; /* Left and right endpoints */int n; /* Number of trapezoids */float h; /* Trapezoid base width */float x;int i;float f(float x); /* Function we're integrating */
printf("Enter a, b, and n\n");scanf("%f %f %d", &a, &b, &n);
h = (b-a)/n;integral = (f(a) + f(b))/2.0;x = a;for (i = 1; i <= n-1; i++) {
x = x + h;integral = integral + f(x);
}integral = integral*h;
printf("With n = %d trapezoids, our estimate\n",n);
printf("of the integral from %f to %f = %f\n",a, b, integral);
} /* main */
float f(float x) {float return_val;/* Calculate f(x). Store calculation in return_val. */
return_val = x*x;return return_val;
} /* f */
Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
CSC 7600 Lecture 7 : MPI1 Spring 2011
Results for the Serial Trapezoidal Rule
a b n f(x) single precision f(x) double precision
2 25 1 7233.500000 7233.500000
2 25 2 5712.625000 5712.625000
2 25 10 5225.945312 5225.945000
2 25 30 5207.916992 5207.919815
2 25 40 5206.934082 5206.934062
2 25 50 5206.475098 5206.477800
2 25 1000 5205.664551 5205.668694
53
Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
a b n f(x) single precision f(x) double precision
2 25 1 7233.500000 7233.500000
2 25 2 5712.625000 5712.625000
2 25 10 5225.945312 5225.945000
2 25 30 5207.916992 5207.919815
2 25 40 5206.934082 5206.934062
2 25 50 5206.475098 5206.477800
2 25 1000 5205.664551 5205.668694
CSC 7600 Lecture 7 : MPI1 Spring 2011
Parallelizing Trapezoidal Rule
• One way of parallelizing Trapezoidal rule :– Distribute chunks of workload (each workload
characterized by its own subinterval of [a,b] to each process)
– Calculate f for each subinterval
– Finally add the f calculated for all the sub intervals to produce result for the complete problem [A,B]
• Issues to consider– Number of trapezoids (n) are equally divisible
across (p) processes (load balancing).
– First process calculates the area for the first
n/p trapezoids, second process calculates the area for the next n/p trapezoids and so on
• Key information related to the problem that each process needs is the – Rank of the process
– Ability to derive the workload per processor as a function of rank
Assumption : Process 0 does the summation
54
1 2
CSC 7600 Lecture 7 : MPI1 Spring 2011
Parallelizing Trapezoidal Rule
• AlgorithmAssumption: Number of trapezoids n is evenly divisible across p processors
– Calculate:
– Each process calculates its own workload (interval to integrate)
• local number of trapezoids ( local_n) = n/p
• local starting point (local_a) = a+(process_rank *local_n* h)
• local ending point (local_b) = (local_a + local_n * h)
– Each process calculates its own integral for the local intervals
• For each of the local_n trapezoids calculate area
• Aggregate area for local_n trapezoids
– If PROCESS_RANK == 0
• Receive messages (containing sub-interval area aggregates) from all processors
• Aggregate (ADD) all sub-interval areas
– If PROCESS_RANK > 0
• Send sub-interval area to PROCESS_RANK(0)
Classic SPMD: all processes run the same program on different datasets.
55
hb a
n
CSC 7600 Lecture 7 : MPI1 Spring 2011
Parallel Trapezoidal Rule
56
#include <stdio.h>#include "mpi.h”
main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float a = 0.0; /* Left endpoint */float b = 1.0; /* Right endpoint */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;
Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
CSC 7600 Lecture 7 : MPI1 Spring 2011
Parallel Trapezoidal Rule
57
float Trap(float local_a, float local_b, int local_n,float h); /* Calculate local integral */
/* Let the system do what it needs to start up MPI */MPI_Init(&argc, &argv);
/* Get my process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Find out how many processes are being used */MPI_Comm_size(MPI_COMM_WORLD, &p);
h = (b-a)/n; /* h is the same for all processes */local_n = n/p; /* So is the number of trapezoids */
/* Length of each process' interval of* integration = local_n*h. So my interval* starts at: */
local_a = a + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h); Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
CSC 7600 Lecture 7 : MPI1 Spring 2011
Parallel Trapezoidal Rule
58
/* Add up the integrals calculated by each process */if (my_rank == 0) {
total = integral;for (source = 1; source < p; source++) {
MPI_Recv(&integral, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);
total = total + integral;}
} else { MPI_Send(&integral, 1, MPI_FLOAT, dest,
tag, MPI_COMM_WORLD);}/* Print the result */if (my_rank == 0) {
printf("With n = %d trapezoids, our estimate\n",n);
printf("of the integral from %f to %f = %f\n",a, b, total);
}/* Shut down MPI */MPI_Finalize();
} /* main */ Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
CSC 7600 Lecture 7 : MPI1 Spring 2011
Parallel Trapezoidal Rule
59
float Trap(float local_a /* in */,float local_b /* in */,int local_n /* in */,float h /* in */) {
float integral; /* Store result in integral */float x;int i;
float f(float x); /* function we're integrating */
integral = (f(local_a) + f(local_b))/2.0;x = local_a;for (i = 1; i <= local_n-1; i++) {
x = x + h;integral = integral + f(x);
}integral = integral*h;return integral;
} /* Trap */float f(float x) {
float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;
} /* f */Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
CSC 7600 Lecture 7 : MPI1 Spring 2011
Parallel Trapezoidal Rule
60
[cdekate@celeritas l7]$ mpiexec -n 8 … trapWith n = 1024 trapezoids, our estimateof the integral from 2.000000 to 25.000000 = 5205.667969Writing logfile....Finished writing logfile.
[cdekate@celeritas l7]$ ./serial Enter a, b, and n2 25 1024With n = 1024 trapezoids, our estimateof the integral from 2.000000 to 25.000000 = 5205.666016[cdekate@celeritas l7]$
CSC 7600 Lecture 7 : MPI1 Spring 2011
61
Topics
• Introduction
• MPI Standard
• MPI-1 Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
Profiling Applications
• To Profile your parallel applications:1. Compile the applications with
mpicc -profile=mpe_mpilog -o trap trap.c
2. Run your applications using the standard procedure using PBS/mpirun
3. After your run is complete you might see lines like these in your stdout (standardout / output file of your pbs-based runWriting logfile....Finished writing logfile.
4. You will also see a file with an extension “clog2”
5. I.e. if your executable was named “parallel_program” you would see a file named “parallel_program.clog2”
6. Convert the “clog2” to “slog2” format by issuing the command“clog2TOslog2 parallel_program.clog2”Maintain the capitalization in the clog2TOslog2 command
7. Step 6 will result in a parallel_program.slog2 file
8. Use Jumpshot to visualize this file
62
CSC 7600 Lecture 7 : MPI1 Spring 2011
Using Jumpshot
Note : You need Java Runtime Environment on your
machine in order to be able to run Jumpshot
Download your parallel_program.slog2 file from Arete
• Download Jumpshot from :
– ftp://ftp.mcs.anl.gov/pub/mpi/slog2/slog2rte.tar.gz
– Uncompress the tar.gz file to get a folder : slog2rte-1.2.6/
– In the slog2rte-1.2.6/lib/
type java -jar jumpshot.jar parallel_program.slog2
• Or click on the jumpshot_launcher.jar
• Open the file using the jumpshot
file menu
63
CSC 7600 Lecture 7 : MPI1 Spring 2011
64
Topics
• Introduction
• MPI Standard
• MPI-1.x Model and Basic Calls
• MPI Communicators
• Point to Point Communication
• Point to Point Communication in-depth
• Deadlock
• Trapezoidal Rule : A Case Study
• Using MPI & Jumpshot to profile parallel applications
• Summary – Materials for the Test
CSC 7600 Lecture 7 : MPI1 Spring 2011
Summary : Material for the Test
• Basic MPI – 10, 11, 12
• Communicators – 18, 19, 20
• Point to Point Communication – 24, 25, 26, 27, 28
• In-depth Point to Point Communication – 33, 34, 35, 36,
37, 38, 39, 40, 41, 42
• Deadlock – 44, 45, 46
65
CSC 7600 Lecture 8 : MPI2
Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
MESSAGE PASSING INTERFACE MPI
(PART B)
Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityFebruary 10, 2011
CSC 7600 Lecture 8 : MPI2
Spring 20112
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 2011
Review of Basic MPI Calls
• In review, the 6 main MPI calls:
– MPI_Init
– MPI_Finalize
– MPI_Comm_size
– MPI_Comm_rank
– MPI_Send
– MPI_Recv
• Include MPI Header file
– #include “mpi.h”
• Basic MPI Datatypes
– MPI_INT, MPI_FLOAT, ….
3
CSC 7600 Lecture 8 : MPI2
Spring 20114
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 2011
Collective Calls
• A communication pattern that encompasses all processes within a communicator is known as
collective communication
• MPI has several collective communication calls, the most frequently used are:– Synchronization
• Barrier
– Communication
• Broadcast
• Gather & Scatter
• All Gather
– Reduction
• Reduce
• AllReduce
5
CSC 7600 Lecture 8 : MPI2
Spring 20116
MPI Collective Calls : Barrier
Function: MPI_Barrier()
int MPI_Barrier (
MPI_Comm comm )
Description:Creates barrier synchronization in a
communicator group comm. Each process,
when reaching the MPI_Barrier call, blocks
until all the processes in the group reach the
same MPI_Barrier call.
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Barrier.html
P0
P1
P2
P3
MP
I_B
arri
er()
P0
P1
P2
P3
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example: MPI_Barrier()
7
#include <stdio.h>#include "mpi.h"
int main (int argc, char *argv[]){int rank, size, len;char name[MPI_MAX_PROCESSOR_NAME];MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &len);
MPI_Barrier(MPI_COMM_WORLD);
printf ("Hello world! Process %d of %d on %s\n", rank, size, name);
MPI_Finalize();return 0;
}
[cdekate@celeritas collective]$ mpirun -np 8 barrierHello world! Process 0 of 8 on celeritas.cct.lsu.eduWriting logfile....Finished writing logfile.Hello world! Process 4 of 8 on compute-0-3.localHello world! Process 1 of 8 on compute-0-0.localHello world! Process 3 of 8 on compute-0-2.localHello world! Process 6 of 8 on compute-0-5.localHello world! Process 7 of 8 on compute-0-6.localHello world! Process 5 of 8 on compute-0-4.localHello world! Process 2 of 8 on compute-0-1.local[cdekate@celeritas collective]$
CSC 7600 Lecture 8 : MPI2
Spring 20118
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 20119
MPI Collective Calls : Broadcast
Function: MPI_Bcast()
int MPI_Bcast (
void *message,
int count,
MPI_Datatype datatype,
int root,
MPI_Comm comm )
Description:A collective communication call where a single process sends the same data contained in the
message to every process in the communicator. By default a tree like algorithm is used to broadcast
the message to a block of processors, a linear algorithm is then used to broadcast the message from
the first process in a block to all other processes. All the processes invoke the MPI_Bcast call with the
same arguments for root and comm,
float endpoint[2]; ...
MPI_Bcast(endpoint, 2, MPI_FLOAT, 0, MPI_COMM_WORLD);...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Bcast.html
AP0
P1
P2
P3
A
A
A
AP0
P1
P2
P3
Broadcast
CSC 7600 Lecture 8 : MPI2
Spring 201110
MPI Collective Calls : Scatter
Function: MPI_Scatter()
int MPI_Scatter (
void *sendbuf,
int send_count,
MPI_Datatype send_type,
void *recvbuf,
int recv_count,
MPI_Datatype recv_type,
int root,
MPI_Comm comm)
Description:MPI_Scatter splits the data referenced by the sendbuf on the process with rank root into p segments each
of which consists of send_count elements of type send_type. The first segment is sent to process0 and the
second segment to process1.The send arguments are significant on the process with rank root.
...
MPI_Scatter(&(local_A[0][0]), n/p, MPI_FLOAT, row_segment, n/p, MPI_FLOAT, 0,MPI_COMM_WORLD);
...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Scatter.html
A, B, C, DP0
P1
P2
P3
B
C
D
AP0
P1
P2
P3
Scatter
local_A[][] row_segment
CSC 7600 Lecture 8 : MPI2
Spring 201111
MPI Collective Calls : Gather
Function: MPI_Gather()
int MPI_Gather (
void *sendbuf,
int send_count,
MPI_Datatype sendtype,
void *recvbuf,
int recvcount,
MPI_Datatype recvtype,
int root,
MPI_Comm comm )
Description:MPI_Gather collects the data referenced by sendbuf from each process in the communicator comm, and
stores the data in process rank order on the process with rank root in the location referenced by
recvbuf.The recv parameters are only significant.
...
MPI_Gather(local_x, n/p, MPI_FLOAT, global_x, n/p, MPI_FLOAT, 0, MPI_COMM_WORLD);
...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Gather.html
A, B, C, DP0
P1
P2
P3
B
C
D
AP0
P1
P2
P3
Gather
global_x local_x
CSC 7600 Lecture 8 : MPI2
Spring 201112
MPI Collective Calls : All Gather
Function: MPI_Allgather()
int MPI_Allgather (
void *sendbuf,
int send_count,
MPI_Datatype sendtype,
void *recvbuf,
int recvcount,
MPI_Datatype recvtype,
MPI_Comm comm )
Description:MPI_Allgather gathers the content from the send buffer (sendbuf) on each process. The effect of this call
is similar to executing MPI_Gather() p times with a different process acting as the root.for (root=0; root<p; root++)
MPI_Gather(local_x, n/p, MPI_FLOAT, global_x, n/p, MPI_FLOAT, root, MPI_COMM_WORLD);
...
CAN BE REPLACED WITH :
MPI_Allgather(local_x, local_n, MPI_FLOAT, global_x, local_n, MPI_FLOAT, MPI_COMM_WORLD);
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Allgather.html
B
C
D
AP0
P1
P2
P3
A, B, C, D
A, B, C, D
A, B, C, D
A, B, C, DP0
P1
P2
P3
All Gather
CSC 7600 Lecture 8 : MPI2
Spring 201113
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 201114
MPI Collective Calls : ReduceFunction: MPI_Reduce()
int MPI_Reduce (
void *operand,
void *result,
int count,
MPI_Datatype datatype,
MPI_Op operator,
int root,
MPI_Comm comm )
Description:A collective communication call where all the processes in a communicator contribute data that is
combined using binary operations (MPI_Op) such as addition, max, min, logical, and, etc. MPI_Reduce
combines the operands stored in the memory referenced by operand using the operation operator and
stores the result in *result. MPI_Reduce is called by all the processes in the communicator comm and for
each of the processes count, datatype operator and root remain the same....
MPI_Reduce(&local_integral, &integral, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Reduce.html
B
C
D
AP0
P1
P2
P3
A+B+C+DP0
P1
P2
P3
Reduce
Binary Op= MPI_SUM
CSC 7600 Lecture 8 : MPI2
Spring 2011
MPI Binary Operations
15
• MPI binary operators are used in the MPI_Reduce function call as one of the
parameters. MPI_Reduce performs a global reduction operation (dictated by the MPI
binary operator parameter) on the supplied operands.
• Some of the common MPI Binary Operators used are :
Operation Name MeaningMPI_MAX MaximumMPI_MIN MinimumMPI_SUM SumMPI_PROD ProductMPI_LAND Logical AndMPI_BAND Bitwise AndMPI_LOR Logical OrMPI_BOR Bitwise OrMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Maximum and location of max.MPI_MINLOC Maximum and location of min.
MPI_Reduce(&local_integral,
&integral, 1, MPI_FLOAT,
MPI_SUM, 0, MPI_COMM_WORLD);
CSC 7600 Lecture 8 : MPI2
Spring 201116
MPI Collective Calls : All Reduce
Functi
on:
MPI_Allreduce()
int MPI_Allreduce (
void *sendbuf,
void *recvbuf,
int count,
MPI_Datatype datatype,
MPI_Op op,
MPI_Comm comm )
Description:MPI_Allreduce is used exactly like MPI_Reduce,
except that the result of the reduction is returned
on all processes, as a result there is no root
parameter....
MPI_Allreduce(&integral, &integral, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Allreduce.html
B
C
D
AP0
P1
P2
P3
A+B+C+D
A+B+C+D
A+B+C+D
A+B+C+DP0
P1
P2
P3
All Reduce
Binary Op= MPI_SUM
CSC 7600 Lecture 8 : MPI2
Spring 2011
Parallel Trapezoidal Rule
Send, Recv
17
#include <stdio.h>#include "mpi.h”
main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float a = 0.0; /* Left endpoint */float b = 1.0; /* Right endpoint */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;
Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
float Trap(float local_a, float local_b, int local_n,float h); /* Calculate local integral */
/* Let the system do what it needs to start up MPI */MPI_Init(&argc, &argv);
/* Get my process rank */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Find out how many processes are being used */MPI_Comm_size(MPI_COMM_WORLD, &p);
h = (b-a)/n; /* h is the same for all processes */local_n = n/p; /* So is the number of trapezoids */
/* Length of each process' interval of* integration = local_n*h. So my interval* starts at: */local_a = a + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h);
CSC 7600 Lecture 8 : MPI2
Spring 2011
Parallel Trapezoidal Rule
Send, Recv
18
if (my_rank == 0) {total = integral;for (source = 1; source < p; source++) {
MPI_Recv(&integral, 1, MPI_FLOAT, source, tag,MPI_COMM_WORLD, &status);
total = total + integral;}
} else {
MPI_Send(&integral, 1, MPI_FLOAT, dest,tag, MPI_COMM_WORLD);
}if (my_rank == 0) {
printf("With n = %d trapezoids, our estimate\n",n);
printf("of the integral from %f to %f = %f\n",a, b, total);
}MPI_Finalize();} /* main */
Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
CSC 7600 Lecture 8 : MPI2
Spring 2011
Parallel Trapezoidal Rule
Send, Recv
19
float Trap(float local_a /* in */,float local_b /* in */,int local_n /* in */,float h /* in */) {
float integral; /* Store result in integral */float x;int i;
float f(float x); /* function we're integrating */
integral = (f(local_a) + f(local_b))/2.0;x = local_a;for (i = 1; i <= local_n-1; i++) {
x = x + h;integral = integral + f(x);
}integral = integral*h;return integral;
} /* Trap */float f(float x) {
float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;
} /* f */Trapezoidal Example Adapted from Parallel Programming in MPI P.Pacheco Ch 4
CSC 7600 Lecture 8 : MPI2
Spring 201120
Flowchart for Parallel Trapezoidal RuleMASTER WORKERS
Initialize MPI Environment
Initialize MPI Environment
Initialize MPI Environment
… Initialize MPI Environment
Create Local Workload buffer (Variables etc)
…
Create Local Workload buffer
Create Local Workload buffer
Create Local Workload buffer
Isolate work regions
Isolate work regions Isolate work
regionsIsolate work
regionsCalculate
Sequential Trapezoid rule
for Local region
…
… Calculate Sequential
Trapezoid rule
for Local region
Calculate Sequential
Trapezoid rule
for Local region
Calculate Sequential
Trapezoid rule
for Local region
Integrate results for local workload
Recv. results from “workers”
Send result to “master”
Send result to “master”
Send result to “master”
…
Concatenate results to file
End
Calculate integral
Calculate integral
Calculate integral
CSC 7600 Lecture 8 : MPI2
Spring 2011
Trapezoidal Rule :
with MPI_Bcast, MPI_Reduce
21
#include <stdio.h>#include <stdlib.h>
/* We'll be using MPI routines, definitions, etc. */#include "mpi.h"
main(int argc, char** argv) {int my_rank; /* My process rank */int p; /* The number of processes */float endpoint[2]; /* Left and right */int n = 1024; /* Number of trapezoids */float h; /* Trapezoid base length */float local_a; /* Left endpoint my process */float local_b; /* Right endpoint my process */int local_n; /* Number of trapezoids for */
/* my calculation */float integral; /* Integral over my interval */float total; /* Total integral */int source; /* Process sending integral */int dest = 0; /* All messages go to 0 */int tag = 0;MPI_Status status;
float Trap(float local_a, float local_b, int local_n,float h); /* Calculate local integral */
MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Comm_size(MPI_COMM_WORLD, &p);
if (argc != 3) {if (my_rank==0)
printf("Usage: mpirun -np <numprocs> trapezoid <left> <right>\n");
MPI_Finalize();exit(0);
}
if (my_rank==0) {endpoint[0] = atof(argv[1]); /* left endpoint */endpoint[1] = atof(argv[2]); /* right endpoint */
}
MPI_Bcast(endpoint, 2, MPI_FLOAT, 0, MPI_COMM_WORLD);
CSC 7600 Lecture 8 : MPI2
Spring 2011
Trapezoidal Rule :
with MPI_Bcast, MPI_Reduce
22
h = (endpoint[1]-endpoint[0])/n; /* h is the same for all processes */
local_n = n/p; /* so is the number of trapezoids */if (my_rank == 0) printf("a=%f, b=%f, Local number of
trapezoids=%d\n", endpoint[0], endpoint[1], local_n );
local_a = endpoint[0] + my_rank*local_n*h;local_b = local_a + local_n*h;integral = Trap(local_a, local_b, local_n, h);
MPI_Reduce(&integral, &total, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (my_rank == 0) {printf("With n = %d trapezoids, our estimate\n",
n);printf("of the integral from %f to %f = %f\n",
endpoint[0], endpoint[1], total);}
MPI_Finalize();} /* main */
float Trap(float local_a /* in */,float local_b /* in */,int local_n /* in */,float h /* in */) {
float integral; /* Store result in integral */float x;int i;
float f(float x); /* function we're integrating */
integral = (f(local_a) + f(local_b))/2.0;x = local_a;for (i = 1; i <= local_n-1; i++) {
x = x + h;integral = integral + f(x);
}integral = integral*h;return integral;
} /* Trap */
float f(float x) {float return_val;/* Calculate f(x). *//* Store calculation in return_val. */return_val = x*x;return return_val;
} /* f */
CSC 7600 Lecture 8 : MPI2
Spring 2011
Trapezoidal Rule :
with MPI_Bcast, MPI_Reduce
23
#!/bin/bash
#PBS -N name
#PBS -l walltime=120:00:00,nodes=2:ppn=4
cd /home/lsu00/Demos/l9/trapBcast
pwd
date
PROCS=`wc -l < $PBS_NODEFILE`
mpdboot --file=$PBS_NODEFILE
/usr/lib64/mpich2/bin/mpiexec -n $PROCS ./trapBcast 2 25 >>out.txt
mpdallexit
date
CSC 7600 Lecture 8 : MPI2
Spring 201124
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 2011
Constructing Datatypes
• Creating data structures in C:typedef struct {
. . .} STRUCT_NAME
• For example : In the numerical integration by trapezoidal rule we could create a data structure for storing the attributes of the problem as follows: typedef struct {
float a,float b,int n;
} DATA_INTEGRAL; . . .. . .DATA_INTEGRAL intg _data;
• What would happen when you use:
MPI_Bcast( &intg_data, 1, DATA_INTEGRAL, 0, MPI_COMM_WORLD);
25
ERROR!!! Intg_data is of the type DATA_INTEGRAL NOT an MPI_Datatype
CSC 7600 Lecture 8 : MPI2
Spring 2011
Constructing MPI Datatypes
• MPI allows users to define derived MPI datatypes, using basic datatypes that build during execution time
• These derived data types can be used in the MPI communication calls, instead of the basic predefined datatypes.
• A sending process can pack noncontiguous data into contiguous buffer and send the buffered data to a receiving process that can unpack the contiguous buffer and store the data to noncontiguous location.
• A derived datatype is an opaque object that specifies :– A sequence of primitive datatypes
– A sequence of integer (byte) displacements
• MPI has several functions for constructing derived datatypes :– Contiguous
– Vector
– Indexed
– Struct
26
CSC 7600 Lecture 8 : MPI2
Spring 2011
MPI : Basic Data Types
(Review)
MPI datatype C datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
27
You can also define your own (derived datatypes), such as an array of ints of size 100, or more complex examples, such as a struct or an array of structs
CSC 7600 Lecture 8 : MPI2
Spring 201128
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 201129
Derived Datatypes : Contiguous
Function: MPI_Type_contiguous()
int MPI_Type_contiguous(
int count,
MPI_Datatype old_type,
MPI_Datatype *new_type)
Description:This is the simplest constructor in the MPI derived datatypes. Contiguous datatype constructors create a new datatype by making count copies of existing data type (old_type)
MPI_Datatype rowtype;...
MPI_Type_contiguous(SIZE, MPI_FLOAT, &rowtype);
MPI_Type_commit(&rowtype);...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Type_contiguous.html
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example : Derived Datatypes - Contiguous
30
#include "mpi.h"#include <stdio.h>
#define SIZE 4
int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;
float a[SIZE][SIZE] ={1.0, 2.0, 3.0, 4.0,5.0, 6.0, 7.0, 8.0,9.0, 10.0, 11.0, 12.0,13.0, 14.0, 15.0, 16.0};
float b[SIZE];
MPI_Status stat;
MPI_Datatype rowtype;
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Type_contiguous(SIZE, MPI_FLOAT, &rowtype);MPI_Type_commit(&rowtype);if (numtasks == SIZE) {if (rank == 0) {
for (i=0; i<numtasks; i++){dest = i;
MPI_Isend(&a[i][0], 1, rowtype, dest, tag, MPI_COMM_WORLD, &req);
}}
MPI_Recv(b, SIZE, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf(“rank= %d b= %3.1f %3.1f %3.1f %3.1f\n”,
rank,b[0],b[1],b[2],b[3]);}
elseprintf(“Must specify %d processors. Terminating.\n”,SIZE);
MPI_Type_free(&rowtype);MPI_Finalize();}
Declares a 4x4 array of datatype float
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.0 11.0 12.0
13.0 14.0 15.0 16.0
https://computing.llnl.gov/tutorials/mpi/
Homogenous datastructureof size 4 (Type : rowtype)
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.0 11.0 12.0
13.0 14.0 15.0 16.0
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example : Derived Datatypes - Contiguous
31
https://computing.llnl.gov/tutorials/mpi/
CSC 7600 Lecture 8 : MPI2
Spring 201132
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 201133
Derived Datatypes : Vector
Function: MPI_Type_vector()
int MPI_Type_vector(
int count,
int blocklen,
int stride,
MPI_Datatype old_type,
MPI_Datatype *newtype )
Description:
Returns a new datatype that represents equally spaced blocks. The spacing between the start of each block is given in units of extent (oldtype). The count represents the number of blocks, blocklen details the number of elements in each block, stride represents the number of elements between start of each block of the old_type. The new datatype is stored in new_type
...
MPI_Type_vector(SIZE, 1, SIZE, MPI_FLOAT, &columntype);...
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Type_vector.html
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example : Derived Datatypes - Vector
34
#include "mpi.h"#include <stdio.h>#define SIZE 4
int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;float a[SIZE][SIZE] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,
13.0, 14.0, 15.0, 16.0};float b[SIZE];
MPI_Status stat;MPI_Datatype columntype;
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Type_vector(SIZE, 1, SIZE, MPI_FLOAT, &columntype);MPI_Type_commit(&columntype);
if (numtasks == SIZE) {if (rank == 0) {
for (i=0; i<numtasks; i++)
MPI_Isend(&a[0][i], 1, columntype, i, tag, MPI_COMM_WORLD, &req);
}
MPI_Recv(b, SIZE, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d b= %3.1f %3.1f %3.1f %3.1f\n",
rank,b[0],b[1],b[2],b[3]);}
elseprintf("Must specify %d processors. Terminating.\n",SIZE);
MPI_Type_free(&columntype);MPI_Finalize();}
https://computing.llnl.gov/tutorials/mpi/
Declares a 4x4 array of datatype float
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.0 11.0 12.0
13.0 14.0 15.0 16.0
Homogenous datastructure of size 4
(Type : columntype)
1.0
5.0
9.0
13.0
2.0
6.0
10.0
14.0
3.0
7.0
11.0
15.0
4.0
8.0
12.0
16.0
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example : Derived Datatypes - Vector
35
https://computing.llnl.gov/tutorials/mpi/
CSC 7600 Lecture 8 : MPI2
Spring 201136
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 201137
Derived Datatypes : Indexed
Function: MPI_Type_indexed()
int MPI_Type_indexed(
int count,
int *array_of_blocklengths,
int *array_of_displacements,
MPI_Datatype oldtype,
MPI_datatype *newtype);
Description:
Returns a new datatype that represents count blocks. Each block is defined by an entry in array_of_blocklengths and
array_of_displacements. Displacements are expressed in units of extent(oldtype). The count is the number of blocks
and the number of entries in array_of_displacements (displacement of each block in units of the oldtype) and
array_of_blocklengths (number of instances of oldtype in each block).
...
MPI_Type_indexed(2, blocklengths, displacements, MPI_FLOAT,
&indextype);...
https://computing.llnl.gov/tutorials/mpi/man/MPI_Type_indexed.txt
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example : Derived Datatypes - Indexed
38
#include "mpi.h"#include <stdio.h>#define NELEMENTS 6
int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, i;MPI_Request req;int blocklengths[2], displacements[2];
float a[16] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0};
float b[NELEMENTS];
MPI_Status stat;MPI_Datatype indextype;
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
blocklengths[0] = 4;blocklengths[1] = 2;displacements[0] = 5;displacements[1] = 12;
MPI_Type_indexed(2, blocklengths, displacements, MPI_FLOAT, &indextype);MPI_Type_commit(&indextype);
if (rank == 0) {for (i=0; i<numtasks; i++)
MPI_Isend(a, 1, indextype, i, tag, MPI_COMM_WORLD, &req);}
MPI_Recv(b, NELEMENTS, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d b= %3.1f %3.1f %3.1f %3.1f %3.1f %3.1f\n",
rank,b[0],b[1],b[2],b[3],b[4],b[5]);
MPI_Type_free(&indextype);MPI_Finalize();}
https://computing.llnl.gov/tutorials/mpi/
Declares a [16][1] array of type float
1.0 2.0 3.0 4.0 5.0 6.0 7.08.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0
Creates a new datatype indextype
1.0 2.0 3.0 4.0 5.0 6.07.08.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example : Derived Datatypes - Indexed
39
https://computing.llnl.gov/tutorials/mpi/
CSC 7600 Lecture 8 : MPI2
Spring 201140
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 201141
Derived Datatypes : struct
Function: MPI_Type_struct()
int MPI_Type_struct(
int count,
int *array_of_blocklengths,
MPI_Aint *array_of_displacements,
MPI_Datatype *array_of_types,
MPI_datatype *newtype);
Description:
Returns a new datatype that represents count blocks. Each is defined by an entry in array_of_blocklengths,
array_of_displacements and array_of_types. Displacements are expressed in bytes. count is an integer that specifies
the number of blocks (number of entries in arrays. The array_of_blocklengths is the number of elements in each
blocks & array_of_displacements specifies the byte displacement of each block. The array_of_types parameter
comprising each block is made of concatenation of type array_of_types.
...
MPI_Type_struct(2, blockcounts, offsets, oldtypes, &particletype);...
https://computing.llnl.gov/tutorials/mpi/man/MPI_Type_struct.txt
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example : Derived Datatype - struct
42
#include "mpi.h"#include <stdio.h>#define NELEM 25int main(argc,argv)int argc;char *argv[]; {int numtasks, rank, source=0, dest, tag=1, I;typedef struct {float x, y, z;float velocity;int n, type;} Particle;
Particle p[NELEM], particles[NELEM];MPI_Datatype particletype, oldtypes[2]; int blockcounts[2];
/* MPI_Aint type used to be consistent with syntax of *//* MPI_Type_extent routine */MPI_Aint offsets[2], extent;
MPI_Status stat;
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
/* Setup description of the 4 MPI_FLOAT fields x, y, z, velocity */offsets[0] = 0;oldtypes[0] = MPI_FLOAT;blockcounts[0] = 4;
MPI_Type_extent(MPI_FLOAT, &extent);offsets[1] = 4 * extent;oldtypes[1] = MPI_INT;blockcounts[1] = 2;MPI_Type_struct(2, blockcounts, offsets, oldtypes, &particletype);MPI_Type_commit(&particletype);
if (rank == 0) {for (i=0; i<NELEM; i++) {
particles[i].x = i * 1.0;particles[i].y = i * -1.0;particles[i].z = i * 1.0; particles[i].velocity = 0.25;particles[i].n = i;particles[i].type = i % 2;
}for (i=0; i<numtasks; i++)
MPI_Send(particles, NELEM, particletype, i, tag, MPI_COMM_WORLD);}
MPI_Recv(p, NELEM, particletype, source, tag, MPI_COMM_WORLD, &stat);printf("rank= %d %3.2f %3.2f %3.2f %3.2f %d %d\n", rank,p[3].x,
p[3].y,p[3].z,p[3].velocity,p[3].n,p[3].type);
MPI_Type_free(&particletype);MPI_Finalize();}
https://computing.llnl.gov/tutorials/mpi/
Declaring the structure of the heterogeneous datatype Float, Float, Float, Float, Int, Int
Construct the heterogeneous datatype as an MPI datatype using Struct
Populate the heterogenous MPI datatype with heterogeneous data
CSC 7600 Lecture 8 : MPI2
Spring 201143
https://computing.llnl.gov/tutorials/mpi/
Example : Derived Datatype - struct
CSC 7600 Lecture 8 : MPI2
Spring 201144
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 201145
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Matrix Vector Multiplication
CSC 7600 Lecture 8 : MPI2
Spring 201146
Matrix Vector Multiplication
where A is an n x m matrix and B is a vector of size m and C is a vector of size n.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Multiplication of a matrix, A and a vector B, produces the vector Cwhose elements, ci (0 <= i < n), are computed as follows:
Ci Aik *Bkk0
km
CSC 7600 Lecture 8 : MPI2
Spring 201147
Matrix-Vector Multiplicationc = A x b
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example: Matrix-Vector Multiplication
(DEMO)
48
#include "mpi.h"#include <stdio.h>#include <stdlib.h>
#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 1 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */
int main (int argc, char *argv[]){int numtasks, /* number of tasks in partition */
taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */
Define the dimensions of the Matrix a([4][4]) and Vector b([4][1])
CSC 7600 Lecture 8 : MPI2
Spring 201149
double a[NRA][NCA], /* Matrix A to be multiplied */b[NCA][NCB], /* Vector B to be multiplied */c[NRA][NCB]; /* result Vector C */
MPI_Status status;
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2 ) {printf("Need at least two MPI tasks. Quitting...\n");
MPI_Abort(MPI_COMM_WORLD, rc);exit(1);}
numworkers = numtasks-1;/**************************** master task ************************************/
if (taskid == MASTER){
printf("mpi_mm has started with %d tasks.\n",numtasks);printf("Initializing arrays...\n");
for (i=0; i<NRA; i++)for (j=0; j<NCA; j++)
a[i][j]= i+j;for (i=0; i<NCA; i++)
for (j=0; j<NCB; j++)b[i][j]= (i+1)*(j+1);
Example: Matrix-Vector Multiplication
Declare the matrix , vector to be multiplied and the resultant vector
MASTER Initializes the Matrix A :0.00 1.00 2.00 3.00 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00
MASTER Initializes B :1.00 2.00 3.00 4.00
CSC 7600 Lecture 8 : MPI2
Spring 201150
for (i=0; i<NRA; i++){
printf("\n"); for (j=0; j<NCA; j++)
printf("%6.2f ", a[i][j]);}
for (i=0; i<NRA; i++){
printf("\n"); for (j=0; j<NCB; j++)
printf("%6.2f ", b[i][j]);}
/* Send matrix data to the worker tasks */
averow = NRA/numworkers;extra = NRA%numworkers;offset = 0;mtype = FROM_MASTER;for (dest=1; dest<=numworkers; dest++){
rows = (dest <= extra) ? averow+1 : averow;
printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,
MPI_COMM_WORLD);MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;}
Example: Matrix-Vector Multiplication
Load Balancing : Dividing the Matrix A based on the number of processors
MASTER sends Matrix A to workers :PROC[0] :: 0.00 1.00 2.00 3.00 PROC[1] :: 1.00 2.00 3.00 4.00 PROC[2] :: 2.00 3.00 4.00 5.00 PROC[3] :: 3.00 4.00 5.00 6.00
MASTER Sends Vector B to Workers:PROC[0] :: 1.00 2.00 3.00 4.00 PROC[1] :: 1.00 2.00 3.00 4.00 PROC[2] :: 1.00 2.00 3.00 4.00 PROC[3] :: 1.00 2.00 3.00 4.00
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example: Matrix-Vector Multiplication
51
/* Receive results from worker tasks */mtype = FROM_WORKER;for (i=1; i<=numworkers; i++){
source = i;
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, &status);printf("Received results from task %d\n",source);
}
/* Print results */printf("******************************************************\n");printf("Result Matrix:\n");for (i=0; i<NRA; i++){
printf("\n"); for (j=0; j<NCB; j++)
printf("%6.2f ", c[i][j]);}printf("\n******************************************************\n");printf ("Done.\n");
}
The Master process gathers the results and populates the result matrix in the correct order (easily done in this case because matrix index I is used to indicate position in result array)
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example: Matrix-Vector Multiplication
52
/**************************** worker task ************************************/if (taskid > MASTER){
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)for (i=0; i<rows; i++){
c[i][k] = 0.0;for (j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];}
mtype = FROM_WORKER;MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}MPI_Finalize();
}
Worker Processes receive workloadProc[1] A : 1.00 2.00 3.00 4.00Proc[1] B : 1.00 2.00 3.00 4.00
Calculate ResultProc[1] C : 1.00 + 4.00 + 9.00 + 16.00
Calculate ResultProc[1] C : 1.00 + 4.00 + 9.00 + 16.00
CSC 7600 Lecture 8 : MPI2
Spring 2011
Example: Matrix-Vector Multiplication
(Results)
53
CSC 7600 Lecture 8 : MPI2
Spring 201154
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 201155
MPI Profiling : MPI_Wtime
Function: MPI_Wtime()
double MPI_Wtime()
Description:
Returns time in seconds elapsed on the calling processor. Resolution of time scale
is determined by the MPI environment variable MPI_WTICK. When the MPI
environment variable MPI_WTIME_IS_GLOBAL is defined and set to true, the the
value of MPI_Wtime is synchronized across all processes in MPI_COMM_WORLD
double time0;...
time0 = MPI_Wtime();...
printf("Hello From Worker #%d %lf \n", rank, (MPI_Wtime() – time0));
http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Wtime.html
CSC 7600 Lecture 8 : MPI2
Spring 2011
Timing Example: MPI_Wtime
56
#include <stdio.h>
#include ”mpi.h”
main(int argc, char **argv)
{
int size, rank;
double time0, time1;MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
time0 = MPI_Wtime();
if(rank==0)
{
printf(" Hello From Proc0 Time = %lf \n", (MPI_Wtime() – time0));}
else
{
printf("Hello From Worker #%d %lf \n", rank, (MPI_Wtime() – time0));}
MPI_Finalize();
}
CSC 7600 Lecture 8 : MPI2
Spring 201157
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 2011
Additional Topics
• Additional topics not yet covered :
– Communication Topologies
– Profiling using Tau (to be covered with PAPI & Parallel
Algorithms)
– Profiling using PMPI (to be covered with PAPI & Parallel
Algorithms)
– Debugging MPI programs
58
CSC 7600 Lecture 8 : MPI2
Spring 201159
Topics
• MPI Collective Calls: Synchronization Primitives
• MPI Collective Calls: Communication Primitives
• MPI Collective Calls: Reduction Primitives
• Derived Datatypes: Introduction
• Derived Datatypes: Contiguous
• Derived Datatypes: Vector
• Derived Datatypes: Indexed
• Derived Datatypes: Struct
• Matrix-Vector multiplication : A Case Study
• MPI Profiling calls
• Additional Topics
• Summary Materials for Test
CSC 7600 Lecture 8 : MPI2
Spring 2011
Summary : Material for the Test
• Collective calls
– Barrier (6) , Broadcast (9), Scatter(10), Gather(11), Allgather(12)
– Reduce(14), Binary operations (15), All Reduce (16)
• Derived Datatypes (25,26,27)
– Contiguous (29,30,31)
– Vector (33,34,35)
60
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
SMP NODES
Prof. Thomas Sterling Department of Computer Science Louisiana State University February 15, 2011
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
2
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
3
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
4
Opening Remarks
• This week is about supercomputer architecture – Last time: end of cooperative computing – Today: capability computing with modern microprocessor and
multicore SMP node
• As we’ve seen, there is a diversity of HPC system types • Most common systems are either SMPs or are
ensembles of SMP nodes • “SMP” stands for: “Symmetric Multi-Processor” • System performance is strongly influenced by SMP node
performance • Understanding structure, functionality, and operation of
SMP nodes will allow effective programming
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
5
The take-away message
• Primary structure and elements that make up an SMP node
• Primary structure and elements that make up the modern multicore microprocessor component
• The factors that determine microprocessor delivered performance
• The factors that determine overall SMP sustained performance
• Amdahl’s law and how to use it • Calculating cpi • Reference: J. Hennessy & D. Patterson, “Computer Architecture
A Quantitative Approach” 3rd Edition, Morgan Kaufmann, 2003
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
6
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
7
SMP Context
• A standalone system – Incorporates everything needed for
• Processors • Memory • External I/O channels • Local disk storage • User interface
– Enterprise server and institutional computing market • Exploits economy of scale to enhance performance to cost • Substantial performance
– Target for ISVs (Independent Software Vendors) • Shared memory multiple thread programming platform
– Easier to program than distributed memory machines – Enough parallelism to fully employ system threads (processor cores)
• Building block for ensemble supercomputers – Commodity clusters – MPPs
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
8
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
9
Performance: Amdahl’s Law
Baton Rouge to Houston • from my house on East Lakeshore Dr. • to downtown HyaT Regency • distance of 271 • in air flight Vme: 1 hour • door to door Vme to drive: 4.5 hours • cruise speed of Boeing 737: 600 mph • cruise speed of BMW 528: 60 mph
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
10
Amdahl’s Law: drive or fly? • Peak performance gain: 10X
– BMW cruise approx. 60 MPH – Boeing 737 cruise approx. 600 MPH
• Time door to door – BMW
• Google estimates 4 hours 30 minutes – Boeing 737
• Time to drive to BTR from my house = 15 minutes • Wait time at BTR = 1 hour • Taxi time at BTR = 5 minutes • Continental estimates BTR to IAH 1 hour • Taxi time at IAH = 15 minutes (assuming gate available) • Time to get bags at IAH = 25 minutes • Time to get rental car = 15 minutes • Time to drive to Hyatt Regency from IAH = 45 minutes • Total time = 4.0 hours
• Sustained performance gain: 1.125X
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
11
Amdahl’s Law
( )
( )
⎟⎟⎠
⎞⎜⎜⎝
⎛+−
=
×⎟⎟⎠
⎞⎜⎜⎝
⎛+×−
=
×⎟⎟⎠
⎞⎜⎜⎝
⎛+×−=
=
=
≡
≡
≡
≡
≡
≡
gff
S
TgfTf
TS
TgfTfT
TTfTTS
SfgTTT
OO
O
OOA
OF
AO
F
A
O
1
1
1
1
appliedon acceleratin with computatio of up speed daccelerate be n tocomputatio daccelerate-non offraction
ncomputatio ofportion dacceleratefor gain eperformancpeak daccelerate becan n that computatio ofportion of time
ncomputatio dacceleratefor timencomputatio daccelerate-nonfor time
start end
TO
TF
start end
TA
TF/g
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
12
Amdahl’s Law and Parallel Computers
• Amdahl’s Law (FracX: original % to be speed up) Speedup = 1 / [(FracX/SpeedupX) + (1-FracX)]
• A portion is sequential => limits parallel speedup – Speedup <= 1/ (1-FracX)
• Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used
80 = 1 / [(FracX/100) + (1-FracX)] 0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1 FracX = (80-1)/79.2 = 0.9975 • Only 0.25% sequential!
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
13
Amdahl’s Law with Overhead
( )
( )
( )O
OO
O
A
O
OOA
n
ii
n
iFiF
Tvn
gff
S
vnTgfTf
TTTS
vnTgfTfT
vV
v
tT
×++−
=
×+×+×−==
×+×+×−=
=≡
≡
=
∑
∑
1
1
1
1
workdacceleratefor overhead total
segment work daccelerate of overhead
start end
TO
tF
start end
TA
v + tF/g
tF tF tF
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
14
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
15
SMP Node Diagram
MP L1 L2
MP L1 L2
L3
MP L1 L2
MP L1 L2
L3
M1 M2 Mn
Controller
S
S
NIC NIC USB Peripherals
JTAG
Legend : MP : MicroProcessor L1,L2,L3 : Caches M1, M2, … : Memory Banks S : Storage NIC : Network Interface Card
Ethernet
PCI-‐e
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
16
SMP System Examples
Vendor & name Processor Number of cores
Cores per proc.
Memory Chipset PCI slots
IBM eServer p5 595
IBM Power5 1.9 GHz
64 2 2 TB Proprietary GX+, RIO-‐2
≤240 PCI-‐X (20 standard)
Microway QuadPuter-‐8
AMD Opteron 2.6 Ghz
16 2 128 GB Nvidia nForce Pro 2200+2050
6 PCIe
Ion M40 Intel Itanium 2 1.6 GHz
8 2 128 GB Hitachi CF-‐3e 4 PCIe 2 PCI-‐X
Intel Server System SR870BN4
Intel Itanium 2 1.6 GHz
8 2 64 GB Intel E8870 8 PCI-‐X
HP Proliant ML570 G3
Intel Xeon 7040 3 GHz
8 2 64 GB Intel 8500 4 PCIe 6 PCI-‐X
Dell PowerEdge 2950
Intel Xeon 5300 2.66 GHz
8 4 32 GB Intel 5000X 3 PCIe
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
17
Sample SMP Systems
DELL PowerEdge
HP Proliant
Intel Server System
IBM p5 595
Microway Quadputer
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
18
HyperTransport-based SMP System
Source: hTp://www.devx.com/amd/ArVcle/17437
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
19
Comparison of Opteron and Xeon SMP Systems
Source: hTp://www.devx.com/amd/ArVcle/17437
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
20
Multi-Chip Module (MCM) Component of IBM Power5 Node
20
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
21
Major Elements of an SMP Node • Processor chip • DRAM main memory cards • Motherboard chip set • On-board memory network
– North bridge • On-board I/O network
– South bridge • PCI industry standard interfaces
– PCI, PCI-X, PCI-express • System Area Network controllers
– e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch • System Management network
– Usually Ethernet – JTAG for low level maintenance
• Internal disk and disk controller • Peripheral interfaces
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
22
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
23
FPU IA-‐32 Control
Instr. Fetch & Decode Cache
Cache
TLB
Integer Units
IA-‐64 Control
Bus
Core Processor Die 4 x 1MB L3 cache
Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00)
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
24
Multicore Microprocessor Component Elements
• Multiple processor cores – One or more processors
• L1 caches – Instruction cache – Data cache
• L2 cache – Joint instruction/data cache – Dedicated to individual core processor
• L3 cache – Not all systems – Shared among multiple cores – Often off die but in same package
• Memory interface – Address translation and management (sometimes) – North bridge
• I/O interface – South bridge
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
25
Comparison of Current Microprocessors
Processor Clock rate Caches (per core)
ILP (each core)
Cores per chip
Process & die size
Power Linpack TPP (one core)
AMD Opteron 2.6 GHz L1I: 64KB L1D: 64KB L2: 1MB
2 FPops/cycle 3 Iops/cycle 2* LS/cycle
2 90nm, 220mm2
95W 3.89 Gflops
IBM Power5+ 2.2 GHz L1I: 64KB L1D: 32KB L2: 1.875MB L3: 18MB
4 FPops/cycle 2 Iops/cycle 2 LS/cycle
2 90nm, 243mm2
180W (est.) 8.33 Gflops
Intel Itanium 2 (9000 series)
1.6 GHz L1I: 16KB L1D: 16KB L2I: 1MB L2D: 256KB L3: 3MB or more
4 FPops/cycle 4 Iops/cycle 2 LS/cycle
2 90nm, 596mm2
104W 5.95 Gflops
Intel Xeon Woodcrest
3 GHz L1I: 32KB L1D: 32KB L2: 2MB
4 Fpops/cycle 3 Iops/cycle 1L+1S/cycle
2 65nm, 144mm2
80W 6.54 Gflops
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
26
Processor Core Micro Architecture
• Execution Pipeline – Stages of functionality to process issued instructions – Hazards are conflicts with continued execution – Forwarding supports closely associated operations exhibiting
precedence constraints • Out of Order Execution
– Uses reservation stations – hides some core latencies and provide fine grain asynchronous
operation supporting concurrency • Branch Prediction
– Permits computation to proceed at a conditional branch point prior to resolving predicate value
– Overlaps follow-on computation with predicate resolution – Requires roll-back or equivalent to correct false guesses – Sometimes follows both paths, and several deep
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
27
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
28
Recap: Who Cares About the Memory Hierarchy?
µProc 60%/yr. (2X/1.5yr)
DRAM 9%/yr. (2X/10 yrs)
1!
10!
100!
1000!
1980
!19
81!
1983
!19
84!
1985
!19
86!
1987
!19
88!
1989
!19
90!
1991
!19
92!
1993
!19
94!
1995
!19
96!
1997
!19
98!
1999
!20
00!
DRAM
CPU!
1982
!
Processor-‐Memory Performance Gap: (grows 50% / year)
Performan
ce
Time
“Moore’s Law”
Processor-‐DRAM Memory Gap (latency)
Copyright 2001, UCB, David PaTerson
CSC 7600 Lecture 9 : SMP Nodes Spring 2011 29
What is a cache? • Small, fast storage used to improve average access time to slow
memory. • Exploits spatial and temporal locality • In computer architecture, almost everything is a cache!
– Registers: a cache on variables – First-level cache: a cache on second-level cache – Second-level cache: a cache on memory – Memory: a cache on disk (virtual memory) – TLB :a cache on page table – Branch-prediction: a cache on prediction information
Proc/Regs
L1-‐Cache
L2-‐Cache
Memory
Disk, Tape, etc.
Bigger Faster
Copyright 2001, UCB, David PaTerson
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
30
Levels of the Memory Hierarchy
CPU Registers 100s Bytes < 0.5 ns (typically 1 CPU cycle)
Cache L1 cache: 10s-‐100s K Bytes 1-‐5 ns $10/ Mbyte
Main Memory Few G Bytes 50ns-‐ 150ns $0.02/ MByte
Disk 100s-‐1000s G Bytes 500000ns-‐ 1500000ns $ 0.25/ GByte
Capacity Access Time Cost
Tape infinite sec-‐min $0.0014/ MByte
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging Xfer Unit
prog./compiler 1-‐8 bytes
cache cntl 8-‐128 bytes
OS 512-‐4K bytes
user/operator Mbytes
Upper Level
Lower Level
faster
Larger
Copyright 2001, UCB, David PaTerson
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
31
Cache Measures
• Hit rate: fraction found in that level – So high that usually talk about Miss rate
• Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)
• Miss penalty: time to replace a block from lower level, including time to replace in CPU
– access time: time to lower level = f(latency to lower level)
– transfer time: time to transfer block =f(BW between upper & lower levels)
Copyright 2001, UCB, David PaTerson
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
32
Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example:
Block X) – Hit Rate: the fraction of memory accesses found in the upper level – Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss • Miss: data needs to be retrieved from a block in the lower level
(Block Y) – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block to the processor • Hit Time << Miss Penalty (500 instructions on 21264!)
Lower Level Memory Upper Level
Memory To Processor
From Processor Blk X
Blk Y
Copyright 2001, UCB, David PaTerson
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
Cache Performance
33
MEMcount
MEMALU
count
ALU
MEMALUcount
cyclecount
CPIIICPI
IICPI
IIITCPIIT
×⎟⎟⎠
⎞⎜⎜⎝
⎛+×⎟⎟
⎠
⎞⎜⎜⎝
⎛=
+=
××=
T = total execuVon Vme Tcycle = Vme for a single processor cycle Icount = total number of instrucVons IALU = number of ALU instrucVons (e.g. register – register) IMEM = number of memory access instrucVons ( e.g. load, store) CPI = average cycles per instrucVons CPIALU = average cycles per ALU instrucVons
CPIMEM = average cycles per memory instrucVon rmiss = cache miss rate rhit = cache hit rate CPIMEM-‐MISS = cycles per cache miss CPIMEM-‐HIT=cycles per cache hit MALU = instrucVon mix for ALU instrucVons MMEM = instrucVon mix for memory access instrucVon
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
Cache Performance
34
( )( )[ ] cycleMEMMEMALUALUcount
MEMMEMALUALU
MEMALU
count
MEMMEM
count
ALUALU
TCPIMCPIMITCPIMCPIMCPI
MMIIM
IIM
nMixInstructio
××+××=
×+×=
=+
=
=
)()(
1
:
T = total execuVon Vme Tcycle = Vme for a single processor cycle Icount = total number of instrucVons IALU = number of ALU instrucVons (e.g. register – register) IMEM = number of memory access instrucVons ( e.g. load, store) CPI = average cycles per instrucVons CPIALU = average cycles per ALU instrucVons
CPIMEM = average cycles per memory instrucVon rmiss = cache miss rate rhit = cache hit rate CPIMEM-‐MISS = cycles per cache miss CPIMEM-‐HIT=cycles per cache hit MALU = instrucVon mix for ALU instrucVons MMEM = instrucVon mix for memory access instrucVon
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
Cache Performance
35
T = total execuVon Vme Tcycle = Vme for a single processor cycle Icount = total number of instrucVons IALU = number of ALU instrucVons (e.g. register – register) IMEM = number of memory access instrucVons ( e.g. load, store) CPI = average cycles per instrucVons CPIALU = average cycles per ALU instrucVons
CPIMEM = average cycles per memory instrucVon rmiss = cache miss rate rhit = cache hit rate CPIMEM-‐MISS = cycles per cache miss CPIMEM-‐HIT=cycles per cache hit MALU = instrucVon mix for ALU instrucVons MMEM = instrucVon mix for memory access instrucVon
( ) ( )( )[ ] cycleMISSMEMMISSHITMEMMEMALUALUcount
MISSMEMMISSHITMEMMEM
TCPIrCPIMCPIMITCPIrCPICPI
××+×+××=
×+=
−−
−−
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
Cache Performance: Example
36
1100
5.01102
1010
11
=
=
=
=
×=
=
−
−
HITMEM
MISSMEM
cycle
ALU
MEM
count
CPICPI
nsTCPIII
2.010102
8.0108
10108
108
11
10
11
10
10
=×
==
==×
==
×=−=
count
MEMMEM
count
ALUALU
MEMcountALU
IIM
IIM
III
sec150105))112.0()18.0((10
11100)9.01(1
9.0
1011
=
×××+××=
=×−+=
×+=
=
−
−−−−
A
MISSMEMAMISSHITMEMAMEM
hitA
T
CPIrCPICPIr
sec550105))512.0()18.0((10
51100)5.01(1
5.0
1011
=
×××+××=
=×−+=
×+=
=
−
−−−−
B
MISSMEMBMISSHITMEMBMEM
hitB
T
CPIrCPICPIr
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
37
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
38
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
39
Motherboard Chipset
• Provides core functionality of motherboard • Embeds low-level protocols to facilitate efficient communication between
local components of computer system • Controls the flow of data between the CPU, system memory, on-board
peripheral devices, expansion interfaces and I/O susbsystem • Also responsible for power management features, retention of non-volatile
configuration data and real-time measurement • Typically consists of:
– Northbridge (Memory Controller Hub, MCH), managing traffic between the processor, RAM, GPU, southbridge and optionally PCI Express slots
– Southbridge (I/O Controller Hub, ICH), coordinating slower set of devices, including traditional PCI bus, ISA bus, SMBus, IDE (ATA), DMA and interrupt controllers, real-time clock, BIOS memory, ACPI power management, LPC bridge (providing fan control, floppy disk, keyboard, mouse, MIDI interfaces, etc.), and optionally Ethernet, USB, IEEE1394, audio codecs and RAID interface
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
40
Major Chipset Vendors
• Intel – http://developer.intel.com/products/chipsets/index.htm
• Via – http://www.via.com.tw/en/products/chipsets
• SiS – http://www.sis.com/products/product_000001.htm
• AMD/ATI – http://ati.amd.com/products/integrated.html
• Nvidia – http://www.nvidia.com/page/mobo.html
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
41
Chipset Features Overview
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
42
Motherboard
• Also referred to as main board, system board, backplane • Provides mechanical and electrical support for pluggable
components of a computer system • Constitutes the central circuitry of a computer,
distributing power and clock signals to target devices, and implementing communication backplane for data exchanges between them
• Defines expansion possibilities of a computer system through slots accommodating special purpose cards, memory modules, processor(s) and I/O ports
• Available in many form factors and with various capabilities to match particular system needs, housing capacity and cost
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
43
Motherboard Form Factors
• Refer to standardized motherboard sizes • Most popular form factor used today is ATX, evolved
from now obsolete AT (Advanced Technology) format • Examples of other common form factors:
– MicroATX, miniaturized version of ATX – WTX, large form factor designated for use in high power
workstations/servers featuring multiple processors – Mini-ITX, designed for use in thin clients – PC/104 and ETX, used in embedded systems and single
board computers – BTX (Balanced Technology Extended), introduced by Intel as
a possible successor to ATX
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
44
Motherboard Manufacturers
• Abit • Albatron • Aopen • ASUS • Biostar • DFI • ECS • Epox • FIC • Foxconn • Gigabyte
• IBM • Intel • Jetway • MSI • ShuTle • Soyo • SuperMicro • Tyan • VIA
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
45
Source: hTp://www.motherboards.org
Populated CPU Socket
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
46
Source: hTp://www.motherboards.org
DIMM Memory Sockets
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
47
Motherboard on Arete
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
48
Source: hTp://www.tyan.com
SuperMike Motherboard: Tyan Thunder i7500 (S720)
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
49
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
50
PCI enhanced systems
http://arstechnica.com/articles/paedia/hardware/pcie.ars/1
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
51
PCI-express
Lane width
Clock speed
Throughput (duplex, bits)
Throughput (duplex, bytes)
Initial expected uses
x1 2.5 GHz 5 Gbps 400 MBps Slots, Gigabit Ethernet
x2 2.5 GHz 10 Gbps 800 MBps
x4 2.5 GHz 20 Gbps 1.6 GBps Slots, 10 Gigabit Ethernet, SCSI, SAS
x8 2.5 GHz 40 Gbps 3.2 GBps
x16 2.5 GHz 80 Gbps 6.4 GBps Graphics adapters
http://www.redbooks.ibm.com/abstracts/tips0456.html
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
52
PCI-X Bus Width Clock Speed Features Bandwidth
PCI-X 66 64 Bits 66 MHz Hot Plugging, 3.3 V 533 MB/s
PCI-X 133 64 Bits 133 MHz Hot Plugging, 3.3 V 1.06 GB/s
PCI-X 266
64 Bits, optional 16 Bits only
133 MHz Double Data Rate
Hot Plugging, 3.3 & 1.5 V, ECC supported 2.13 GB/s
PCI-X 533
64 Bits, optional 16 Bits only
133 MHz Quad Data Rate
Hot Plugging, 3.3 & 1.5 V, ECC supported 4.26 GB/s
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
53
Bandwidth Comparisons CONNECTION BITS BYTES
PCI 32-bit/33 MHz 1.06666 Gbit/s 133.33 MB/s
PCI 64-bit/33 MHz 2.13333 Gbit/s 266.66 MB/s
PCI 32-bit/66 MHz 2.13333 Gbit/s 266.66 MB/s
PCI 64-bit/66 MHz 4.26666 Gbit/s 533.33 MB/s
PCI 64-bit/100 MHz 6.39999 Gbit/s 799.99 MB/s
PCI Express (x1 link)[6] 2.5 Gbit/s 250 MB/s
PCI Express (x4 link)[6] 10 Gbit/s 1 GB/s
PCI Express (x8 link)[6] 20 Gbit/s 2 GB/s PCI Express (x16 link)[6] 40 Gbit/s 4 GB/s
PCI Express 2.0 (x32 link)[6] 80 Gbit/s 8 GB/s
PCI-X DDR 16-bit 4.26666 Gbit/s 533.33 MB/s
PCI-X 133 8.53333 Gbit/s 1.06666 GB/s
PCI-X QDR 16-bit 8.53333 Gbit/s 1.06666 GB/s
PCI-X DDR 17.066 Gbit/s 2.133 GB/s
PCI-X QDR 34.133 Gbit/s 4.266 GB/s
AGP 8x 17.066 Gbit/s 2.133 GB/s
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
54
HyperTransport : Context
• Northbridge-Southbridge device connection facilitates communication over fast processor bus between system memory, graphics adaptor, CPU
• Southbridge operates several I/O interfaces, through the Northbridge operating over another proprietary connection
• This approach is potentially limited by the emerging bandwidth demands over inadequate I/O buses
• HyperTransport is one of the many technologies aimed at improving I/O.
• High data rates are achieved by using enhanced, low-swing, 1.2 V Low Voltage Differential Signaling (LVDS) that employs fewer pins and wires consequently reducing cost and power requirements.
• HyperTransport also helps in communication between multiple AMD Opteron CPUs
http://www.amd.com/us-en/Processors/ComputingSolutions/0,,30_288_13265_13295%5E13340,00.html
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
55
Hyper-Transport (continued) • Point-to-point parallel topology uses 2
unidirectional links (one each for upstream and downstream)
• HyperTransport technology chunks data into packets to reduce overhead and improve efficiency of transfers.
• Each HyperTransport technology link also contains 8-bit data path that allows for insertion of a control packet in the middle of a long data packet, thus reducing latency.
• In Summary : “HyperTransport™ technology delivers the raw throughput and low latency necessary for chip-to-chip communication. It increases I/O bandwidth, cuts down the number of different system buses, reduces power consumption, provides a flexible, modular bridge architecture, and ensures compatibility with PCI. “
http://www.amd.com/us-en/Processors/ComputingSolutions /0,,30_288_13265_13295%5E13340,00.html
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
56
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
57
Performance Issues
• Cache behavior – Hit/miss rate – Replacement strategies
• Prefetching • Clock rate • ILP • Branch prediction • Memory
– Access time – Bandwidth
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
58
Topics
• Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
59
Summary – Material for the Test
• Please make sure that you have addressed all points outlined on slide 5
• Understand content on slide 7 • Understand concepts, equations, problems on
slides 11, 12, 13 • Understand content on 21, 24, 26, 29 • Understand concepts on slides 32,33,34,35,36 • Understand content on slides 39, 57
• Required reading material :
http://arstechnica.com/articles/paedia/hardware/pcie.ars/1
CSC 7600 Lecture 9 : SMP Nodes Spring 2011
60
CSC 7600 Lecture 11 : Pthreads Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
Pthreads
Prof. Thomas Sterling Department of Computer Science Louisiana State University February 22, 2011
CSC 7600 Lecture 11 : Pthreads Spring 2011
2
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
3
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
Opening Remarks
• We now have a good picture of supercomputer architecture – including SMP structures
• which are the building blocks of most HPC systems on the Top-500 List
• We were introduced to the first two programming methods for exploiting parallelism – Capacity Computing - Condor – Co-operative Computing - MPI
• Now we explore a 3rd programming model: multithreaded computing on shared memory systems – This time: general principles and POSIX Pthreads – Next time: OpenMP
4
CSC 7600 Lecture 11 : Pthreads Spring 2011
What you’ll Need to Know
• Modeling time to execution with CPI • Multi-thread programming and execution concepts
– Parallelism with multiple threads – Synchronization – Memory consistency models
• Basic Pthread commands • Dangers
– Race conditions – Deadlock
5
CSC 7600 Lecture 11 : Pthreads Spring 2011
6
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
7
CPI
rate miss cache nsinstructiomemory for timeexecution
nsinstructioregister for timeexecution timeexecution
timecycle nsinstructiomemory executed ofnumber
nsinstructioregister executed ofnumber nsinstructio executed ofnumber
penalty) (miss miss cache with operationsmemory for cpi hit cache with operationsmemory for cpi
operationsmemory for cpi operationsregister for cpi
n instructioper cycles
≡
≡
≡
≡
≡
≡
≡
≡
≡
≡
≡
≡
≡
miss
M
R
c
M
R
Mmiss
Mhit
M
R
rTTTt
I#I#I#
cpicpicpicpicpi
CSC 7600 Lecture 11 : Pthreads Spring 2011
8
CPI (continued)
( )( )( )( ) cMmissmissMhitmissMRR
MmissmissMhitmissM
MRMMRR
MM
RR
c
tcpircpirmcpimI# Tcpircpircpi
mmcpimcpimcpiI#I#m
I#I#mtcpiI#T
××+×−×+××=
×+×−=
=+×+×=
≡
≡
××=
11
0.1 where
CSC 7600 Lecture 11 : Pthreads Spring 2011
An Example
Robert hates parallel compu;ng and runs all of his jobs on a single processor core on his Acme computer. His current applica;on plays solitaire because he is too lazy to flip the cards himself. The machine he is running on has a 2 GHz clock. For this problem the basic register opera;ons make up only 75% of the instruc;on mix but delivers one and a half instruc;ons per cycle while the load and store opera;ons yield one per cycle. But his cache hit rate is only 80% and the average penalty for not finding data in the L1 cache is 120 nanoseconds. A counter on the Acme processor tells Robert that it takes approximately 16 billion instruc;on execu;ons to run his short program. How long does it take to execute Robert’s applica;on?
9
CSC 7600 Lecture 11 : Pthreads Spring 2011
And the answer is …
( )
( ) ( )
( ) ( )( )( ) seconds 6.1017.1281052.125.0106.1
1052402.018.025.03/275.0106.1
25.0cycles 240ns 120cycles/ns 2
13/2
2.018.0snanosecond 5.0GHz 0.2_
000,000,000,16#
1010
1010
=×=××+×=
×××+××+××=
=⇒=
=×=
=
=
=⇒−==
=⇒=
=
−
−
TT
m0.75mcpicpicpi
rrrtrateclock
I
MR
Mmiss
Mhit
R
missmisshit
c
10
CSC 7600 Lecture 11 : Pthreads Spring 2011
11
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
Address Space
Thread 1
Address Space
global data
UNIX Processes vs. Multithreaded Programs
12
exec. state
stack
PID
text
Address Space
global data
exec. state
stack
PID1
text
Copy of PID1’s Address Space
global data
exec. state
stack
PID2
text
fork()
shared data
exec. state
stack
PID
text
private data
Thread 2
exec. state
stack
private data
thread create
Thread m
Standard UNIX process
(single-‐threaded) New process spawned via fork() Mul;threaded Applica;on
CSC 7600 Lecture 11 : Pthreads Spring 2011
13
Anatomy of a Thread
Thread (or, more precisely: thread of execu-on) is typically described as a lightweight process. There are, however, significant differences in the way standard processes and threads are created, how they interact and access resources. Many aspects of these are implementa;on dependent.
Private state of a thread includes: • Execu;on state (instruc;on pointer, registers) • Stack • Private variables (typically allocated on thread’s stack)
Threads share access to global data in applica;on’s address space.
CSC 7600 Lecture 11 : Pthreads Spring 2011
14
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
15
Race Conditions
Example: consider the following piece of pseudo-‐code to be executed concurrently by threads T1 and T2 (the ini;al value of memory loca;on A is x)
A→R: read memory location A into register R R++: increment register R A←R: write R into memory location A
Scenario 1: Step 1) T1:(A→R) → T1:R=x Step 2) T1:(R++) → T1:R=x+1 Step 3) T1:(A←R) → T1:A=x+1 Step 4) T2:(A→R) → T2:R=x+1 Step 5) T2:(R++) → T2:R=x+2 Step 6) T2:(A←R) → T2:A=x+2
Scenario 2: Step 1) T1:(A→R) → T1:R=x Step 2) T2:(A→R) → T2:R=x Step 3) T1:(R++) → T1:R=x+1 Step 4) T2:(R++) → T2:R=x+1 Step 5) T1:(A←R) → T1:A=x+1 Step 6) T2:(A←R) → T2:A=x+1
Since threads are scheduled arbitrarily by an external en;ty, the lack of explicit synchroniza;on may cause different outcomes.
Race condition (or race hazard) is a flaw in system or process whereby the output of the system or process is unexpectedly and critically dependent on the sequence or timing of other events.
Suggested reading: hdp://en.wikipedia.org/wiki/Race_condi;on
CSC 7600 Lecture 11 : Pthreads Spring 2011
Critical Sections
16
Critical section is a segment of code accessing a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution.
Suggested reading: hdp://en.wikipedia.org/wiki/Cri;cal_sec;on
critical section
The implementa;on of cri;cal sec;on must prevent any change of processor control once the execu;on enters the cri;cal sec;on.
• Code on uniprocessor systems may rely on disabling interrupts and avoiding system calls leading to context switches, restoring the interrupt mask to the previous state upon exit from the cri;cal sec;on
• General solu;ons rely on synchroniza;on mechanisms (hardware-‐assisted when possible), discussed on the next slides
CSC 7600 Lecture 11 : Pthreads Spring 2011
Thread Synchronization Mechanisms
• Based on atomic memory operation (require hardware support) – Spinlocks – Mutexes (and condition variables) – Semaphores – Derived constructs: monitors, rendezvous, mailboxes, etc.
• Shared memory based locking – Dekker’s algorithm
http://en.wikipedia.org/wiki/Dekker%27s_algorithm
– Peterson’s algorithm http://en.wikipedia.org/wiki/Peterson%27s_algorithm
– Lamport’s algorithm http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm http://research.microsoft.com/users/lamport/pubs/bakery.pdf
17
CSC 7600 Lecture 11 : Pthreads Spring 2011
Spinlocks
• Spinlock is the simplest kind of lock, where a thread waiting for the lock to become available repeatedly checks lock’s status
• Since the thread remains active, but doesn’t perform a useful computation, such a lock is essentially busy-waiting, and hence generally wasteful
• Spinlocks are desirable in some scenarios: – If the waiting time is short, spinlocks save the overhead and cost of context
switches, required if other threads have to be scheduled instead – In real-time system applications, spinlocks offer good and predictable
response time
• Typically use fair scheduling of threads to work correctly • Spinlock implementations require atomic hardware primitives,
such as test-and-set, fetch-and-add, compare-and-swap, etc.
18
Suggested reading: hdp://en.wikipedia.org/wiki/Spinlock
CSC 7600 Lecture 11 : Pthreads Spring 2011
Mutexes
• Mutex (abbreviation for mutual exclusion) is an algorithm used to prevent concurrent accesses to a common resource. The name also applies to the program object which negotiates access to that resource.
• Mutex works by atomically setting an internal flag when a thread (mutex owner) enters a critical section of the code. As long as the flag is set, no other threads are permitted to enter the section. When the mutex owner completes operations within the critical section, the flag is (atomically) cleared.
19
Suggested reading: hdp://en.wikipedia.org/wiki/Mutex
lock(mutex) critical section unlock(mutex)
CSC 7600 Lecture 11 : Pthreads Spring 2011
Condition Variables • Condition variables are frequently used in association with mutexes to increase
the efficiency of execution in multithreaded environments • Typical use involves a thread or threads waiting for a certain condition (based on
the values of variables inside the critical section) to occur. Note that: – The thread cannot wait inside the critical section, since no other thread would be
permitted to enter and modify the variables – The thread could monitor the values by repeatedly accessing the critical section
through its mutex; such a solution is typically very wasteful • Condition variable permits the waiting thread to temporarily release the mutex it
owns, and provide the means for other threads to communicate the state change within the critical section to the waiting thread (if such a change occurred)
20
/* waiting thread code: */ lock(mutex); /* check if you can progress */ while (condition not true) wait(cond_var); /* now you can; do your work */ ... unlock(mutex);
/* modifying thread code: */ lock(mutex); /* update critical section variables */ ... /* announce state change */ signal(cond_var); unlock(mutex);
CSC 7600 Lecture 11 : Pthreads Spring 2011
Semaphores • Semaphore is a protected variable introduced by Edsger Dijkstra (in the “THE”
operating system) and constitutes the classic method for restricting access to shared resource
• It is associated with an integer variable (semaphore’s value) and a queue of waiting threads
• Semaphore can be accessed only via the atomic P and V primitives:
• Usage: – Semaphore’s value S.v is initialized to a positive number – Semaphore’s queue S.q is initially empty – Entrance to critical section is guarded by P(S) – When exiting critical section, V(S) is invoked – Note: mutex can be implemented as a binary semaphore
21
P(semaphore S) { if S.v > 0 then S.v := S.v-1; else { insert current thread in S.q; change its state to blocked; schedule another thread; } }
V(semaphore S) { if S.v = 0 and not empty(S.q) then { pick a thread T from S.q; change T’s state to ready; } else S.v := S.v+1; }
Suggested reading: hdp://www.mcs.drexel.edu/~shartley/OSusingSR/semaphores.html hdp://en.wikipedia.org/wiki/Semaphore_(programming)
CSC 7600 Lecture 11 : Pthreads Spring 2011
Disadvantages of Locks
• Blocking mechanism (forces threads to wait) • Conservative (lock has to be acquired when there’s only a
possibility of access conflict) • Vulnerable to faults and failures (what if the owner of the lock
dies?) • Programming is difficult and error prone (deadlocks, starvation) • Does not scale with problem size and complexity • Require balancing the granularity of locked data against the cost
of fine-grain locks • Not composable • Suffer from priority inversion and convoying • Difficult to debug
22
Reference: hdp://en.wikipedia.org/wiki/Lock_(computer_science)
CSC 7600 Lecture 11 : Pthreads Spring 2011
23
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
24
Shared Memory Consistency Model
• Defines memory functionality related to read and write operations by multiple processors – Determines the order of read values in response to the order of
write values by multiple processors – Enables the writing of correct, efficient, and repeatable shared
memory programs • Establishes a formal discipline that places restrictions on
the values that can be returned by a read in a shared-memory program execution – Avoids non-determinacy in memory behavior – Provides a programmer perspective on expected behavior – Imposes demands on system memory operation
• Two general classes of consistency models: – Sequential consistency – Relaxed consistency
CSC 7600 Lecture 11 : Pthreads Spring 2011
25
Sequential Consistency Model
• Most widely adopted memory model • Required:
– Maintaining program order among operations from individual processors
– Maintaining a single sequential order among operations from all processors
• Enforces effect of atomic complex memory operations – Enables compound atomic operations – Avoids race conditions – Precludes non-determinacy from dueling processors
CSC 7600 Lecture 11 : Pthreads Spring 2011
26
Relaxed Consistency Models
• Sequential consistency over-constrains parallel execution limiting parallel performance and scalability – Critical sections impose sequential bottlenecks – Amdahl’s Law applies imposing upper bound on performance
• Relaxed consistency models permit optimizations not possible under limitations of sequential consistency
• Forms of relaxed consistency – Program order
• Write to read • Write to write • Read to following read or write
– Write atomicity • Read value of its own previous write prior to being visible to all
other processors
CSC 7600 Lecture 11 : Pthreads Spring 2011
27
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
Dining Philosophers Problem
28
Description: • N philosophers (N > 3) spend their time eating and thinking at the round table • There are N plates and N forks (or chopsticks, in some versions) between the plates • Eating requires two forks, which may be picked one at a time, at each side of the plate • When any of the philosophers is done eating, he starts thinking • When a philosopher becomes hungry, he attempts to start eating • They do it in complete silence as to not disturb each other (hence no communication to synchronize their actions is possible)
A varia;on on Edsger Dijkstra’s five computers compe;ng for access to five shared tape drives problem (introduced in 1971), retold by Tony Hoare.
Problem: How must they acquire/release forks to ensure that each of them maintains a healthy balance between meditation and eating?
CSC 7600 Lecture 11 : Pthreads Spring 2011
What Can Go Wrong at the Philosophers Table?
• Deadlock If all philosophers decide to eat at the same time and pick forks at the same side of their plates, they are stuck forever waiting for the second fork.
• Livelock Livelock frequently occurs as a consequence of a poorly thought out deadlock prevention strategy. Assume that all philosophers: (a) wait some length of time to put down the fork they hold after noticing that they are unable to acquire the second fork, and then (b) wait some amount of time to reacquire the forks. If they happen to get hungry at the same time and pick one fork using scenario leading to a deadlock and all (a) and (b) timeouts are set to the same value, they won’t be able to progress (even though there is no actual resource shortage).
• Starvation There may be at least one philosopher unable to acquire both forks due to timing issues. For example, his neighbors may alternately keep picking one of the forks just ahead of him and take advantage of the fact that he is forced to put down the only fork he was able to get hold of due to deadlock avoidance mechanism.
29
CSC 7600 Lecture 11 : Pthreads Spring 2011
30
Priority Inversion
• How it happens: – A low priority thread locks the mutex for some shared resource – A high priority thread requires access to the same resource (waits for the
mutex) – In the meantime, a medium priority thread (not depending on the common
resource) gets scheduled, preempting the low priority thread and thus preventing it from releasing the mutex
• A classic occurrence of this phenomenon lead to system reset and subsequent loss of data in Mars Pathfinder mission in 1997: http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder.html
Priority inversion is the scenario where a low priority thread holds a shared resource that is required by a high priority thread.
Suggested reading: hdp://en.wikipedia.org/wiki/Priority_inversion
CSC 7600 Lecture 11 : Pthreads Spring 2011
31
Spurious Wakeups
• Spurious wakeup is a phenomenon associated with a thread waiting on a condition variable
• In most cases, such a thread is supposed to return from call to wait() only if the condition variable has been signaled or broadcast
• Occasionally, the waiting thread gets unblocked unexpectedly, either due to thread implementation performance trade-offs, or scheduler deficiencies
• Lesson: upon exit from wait(), test the predicate to make sure the waiting thread indeed may proceed (i.e., the data it was waiting for have been provided). The side effect is a more robust code.
Suggested reading: hdp://en.wikipedia.org/wiki/Spurious_wakeup
CSC 7600 Lecture 11 : Pthreads Spring 2011
Thread Safety A code is thread-safe if it functions correctly during simultaneous execution by multiple threads.
• Indicators helpful in determining thread safety – How the code accesses global variables and heap – How it allocates and frees resources that have global limits – How it performs indirect accesses (through pointers or handles) – Are there any visible side effects
• Achieving thread safety
– Re-entrancy: property of code, which may be interrupted during execution of one task, reentered to perform another, and then resumed on its original task without undesirable effects
– Mutual exclusion: accesses to shared data are serialized to ensure that only one thread performs critical state update. Acquire locks in an identical order on all threads
– Thread-local storage: as much of the accessed data as possible should be placed in thread’s private variables
– Atomic operations: should be the preferred mechanism of use when operating on shared state
32
CSC 7600 Lecture 11 : Pthreads Spring 2011
33
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
Common Approaches to Thread Implementation
• Kernel threads • User-space threads • Hybrid implementations
34
References: 1. POSIX Threads on HP-UX 11i, http://devresource.hp.com/drc/resources/pthread_wp_jul2004.pdf 2. SunOS Multi-thread Architecture by M. L. Powell, S. R. Kleinman, et al. http://opensolaris.org/os/project/muskoka/doc_attic/mt_arch.pdf
CSC 7600 Lecture 11 : Pthreads Spring 2011
Kernel Threads
• Also referred to as Light Weight Processes • Known to and individually managed by the kernel • Can make system calls independently • Can run in parallel on a multiprocessor (map directly onto
available execution hardware) • Typically have wider range of scheduling capabilities • Support preemptive multithreading natively • Require kernel support and resources • Have higher management overhead
35
CSC 7600 Lecture 11 : Pthreads Spring 2011
User-space Threads
• Also known as fibers or coroutines • Operate on top of kernel threads, mapped to them via user-space
scheduler • Thread manipulations (“context switches”, etc.) are performed entirely
in user space • Usually scheduled cooperatively (i.e., non-preemptively), complicating
the application code due to inclusion of explicit processor yield statements
• Context switches cost less (on the order of subroutine invocation) • Consume less resources than kernel threads; their number can be
consequently much higher without imposing significant overhead • Blocking system calls present a challenge and may lead to inefficient
processor usage (user-space scheduler is ignorant of the occurrence of blocking; no notification mechanism exists in kernel either)
36
CSC 7600 Lecture 11 : Pthreads Spring 2011
MxN Threading
• Available on NetBSD , HPUX an Solaris to complement the existing 1x1 (kernel threads only) and Mx1 (multiplexed user threads) libraries
• Multiplex M lightweight user-space threads on top of N kernel threads, M > N (sometimes M >> N)
• User threads are unbound and scheduled on Virtual Processors (which in turn execute on kernel threads); user thread may effectively move from one kernel thread to another in its lifetime
• In some implementations Virtual Processors rely on the concept of Scheduler Activations to deal with the issue of user-space threads blocking during system calls
37
CSC 7600 Lecture 11 : Pthreads Spring 2011
38
Scheduler Activations • Developed in 1991 at the University of Washington • Typically used in implementations involving user-space threads • Require kernel cooperation in form of a lightweight upcall mechanism to
communicate blocking and unblocking events to the user-space scheduler
Reference: T. Anderson, B. Bershad, E. Lazowska and H. Levy, Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism, http://www.cs.washington.edu/homes/bershad/Papers/p53-anderson.pdf
– Unbound user threads are scheduled on Virtual Processors (which in turn execute on kernel threads) – A user thread may effectively move from one kernel thread to another in its lifetime – Scheduler Activation resembles and is scheduled like a kernel thread – Scheduler Activation provides its replacement to the user-space scheduler when the unbound thread invokes a blocking operation in the kernel – The new Scheduler Activation continues the operations of the same VP
CSC 7600 Lecture 11 : Pthreads Spring 2011
39
Examples of Multi-Threaded System Implementations
• The most commonly used thread package on Linux is Native POSIX Thread Library (NPTL)
– Requires kernel version 2.6 – 1x1 model, mapping each application thread to a kernel thread – Bundled by default with recent versions of glibc – High-performance implementation – POSIX (Pthreads) compliant
• Most of the prominent operating systems feature their own thread implementations, for example:
– FreeBSD: three thread libraries, each supporting different execution model (user-space, 1x1, MxN with scheduler activations)
– Solaris: kernel-level execution through LWPs (Lightweight Processes); user threads execute in context of LWPs and are controlled by system library
– HPUX: Pthreads compliant MxN implementation – MS Windows: threads as smallest kernel-level execution objects, fibers as smallest user-
level execution objects controlled by the programmer; many-to-many scheduling supported • There are numerous open-source thread libraries (mostly for Linux): LinuxThreads,
GNU Pth, Bare-Bone Threads, FSU Pthreads, DCEthreads, Nthreads, CLthreads, PCthreads, LWP, QuickThreads, Marcel, etc.
CSC 7600 Lecture 11 : Pthreads Spring 2011
40
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
POSIX Threads (Pthreads) • POSIX Threads define POSIX standard for multithreaded API (IEEE POSIX
1003.1-1995) • The functions comprising core functionality of Pthreads can be divided into
three classes: – Thread management – Mutexes – Condition variables
• Pthreads define the interface using C language types, function prototypes and macros
• Naming conventions for identifiers: – pthread_: Threads themselves and miscellaneous subroutines – pthread_attr_: Thread attributes objects – pthread_mutex_: Mutexes – pthread_mutexattr_: Mutex attributes objects – pthread_cond_: Condition variables – pthread_condattr_: Condition attributes objects – pthread_key_: Thread-specific data keys
41
References: 1. http://www.llnl.gov/computing/tutorials/pthreads/ 2. http://www.opengroup.org/onlinepubs/007908799/xsh/pthread.h.html
CSC 7600 Lecture 11 : Pthreads Spring 2011
Programming with Pthreads The scope of this short tutorial is: • General thread management • Synchronization
– Mutexes – Condition variables
• Miscellaneous functions
42
CSC 7600 Lecture 11 : Pthreads Spring 2011
43
Pthreads: Thread Creation
Function: pthread_create()
int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*routine)(void *), void *arg); Description: Creates a new thread within a process. The created thread starts execution of routine, which is passed a pointer argument arg. The attributes of the new thread can be specified through attr, or left at default values if attr is null. Successful call returns 0 and stores the id of the new thread in location pointed to by thread, otherwise an error code is returned.
#include <pthread.h> ... void *do_work(void *input_data) { /* this is thread’s starting routine */ ... } ... pthread_t id; struct {. . .} args = {. . .}; /* struct containing thread arguments */ int err; ... /* create new thread with default attributes */ err = pthread_create(&id, NULL, do_work, (void *)&args); if (err != 0) {/* handle thread creation failure */} ...
CSC 7600 Lecture 11 : Pthreads Spring 2011
44
Pthreads: Thread Join
Function: pthread_join()
int pthread_join(pthread_t thread, void **value_ptr);
Description: Suspends the execution of the calling thread until the target thread terminates (either by returning from its startup routine, or calling pthread_exit()), unless the target thread already terminated. If value_ptr is not null, the return value from the target thread or argument passed to pthread_exit() is made available in location pointed to by value_ptr. When pthread_join() returns successfully (i.e. with zero return code), the target thread has been terminated.
#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... void *result_ptr; int err; ... /* create worker thread */ pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_join(id, &result_ptr); if (err != 0) {/* handle join error */} else {/* the worker thread is terminated and result_ptr points to its return value */ ... }
CSC 7600 Lecture 11 : Pthreads Spring 2011
45
Pthreads: Thread Exit Function: pthread_exit()
void pthread_exit(void *value_ptr);
Description: Terminates the calling thread and makes the value_ptr available to any successful join with the terminating thread. Performs cleanup of local thread environment by calling cancellation handlers and data destructor functions. Thread termination does not release any application visible resources, such as mutexes and file descriptors, nor does it perform any process-level cleanup actions.
#include <pthread.h> ... void *do_work(void *args) { ... pthread_exit(&return_value); /* the code following pthread_exit is not executed */ ... } ... void *result_ptr; pthread_t id; pthread_create(&id, NULL, do_work, (void *)&args); ... pthread_join(id, &result); /* result_ptr now points to return_value */ ...
CSC 7600 Lecture 11 : Pthreads Spring 2011
46
Pthreads: Thread Termination
Function: pthread_cancel()
void pthread_cancel(thread_t thread);
Description: The pthread_cancel() requests cancellation of thread thread. The ability to cancel a thread is dependent on its state and type.
#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... pthread_t id; int err; pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_cancel(id); if (err != 0) {/* handle cancelation failure */} ...
CSC 7600 Lecture 11 : Pthreads Spring 2011
47
Pthreads: Detached Threads
Function: pthread_detach()
int pthread_detach(pthread_t thread);
Description: Indicates to the implementation that storage for thread thread can be reclaimed when the thread terminates. If the thread has not terminated, pthread_detach() is not going to cause it to terminate. Returns zero on success, error number otherwise.
#include <pthread.h> ... void *do_work(void *args) {/* workload to be executed by thread */} ... pthread_t id; int err; ... /* start a new thread */ pthread_create(&id, NULL, do_work, (void *)&args); ... err = pthread_detach(id); if (err != 0) {/* handle detachment failure */} else {/* master thread doesn’t join the worker thread; the worker thread resources will be released automatically after it terminates */ ... }
CSC 7600 Lecture 11 : Pthreads Spring 2011
48
Pthreads: Operations on Mutex Objects (I)
#include <pthread.h> ... pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; ... /* lock the mutex before entering critical section */ pthread_mutex_lock(&mutex); /* critical section code */ ... /* leave critical section and release the mutex */ pthread_mutex_unlock(&mutex); ...
Function: pthread_mutex_lock(), pthread_mutex_unlock()
int pthread_mutex_lock(pthread_mutex_t *mutex); int pthread_mutex_unlock(pthread_mutex_t *mutex); Description: The mutex object referenced by mutex shall be locked by calling pthread_mutex_lock(). If the mutex is already locked, the calling thread blocks until the mutex becomes available. After successful return from the call, the mutex object referenced by mutex is in locked state with the calling thread as its owner. The mutex object referenced by mutex is released by calling pthread_mutex_unlock(). If there are threads blocked on the mutex, scheduling policy decides which of them shall acquire the released mutex.
CSC 7600 Lecture 11 : Pthreads Spring 2011
49
Pthreads: Operations on Mutex Objects (II)
#include <pthread.h> ... pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; int err; ... /* attempt to lock the mutex */ err = pthread_mutex_trylock(&mutex); switch (err) { case 0: /* lock acquired; execute critical section code and release mutex */ ... pthread_mutex_unlock(&mutex); break; case EBUSY: /* someone already owns the mutex; do something else instead of blocking */ ... break; default: /* some other failure */ ... break; }
Function: pthread_mutex_trylock()
int pthread_mutex_trylock(pthread_mutex_t *mutex);
Description: The function pthread_mutex_trylock() is equivalent to pthread_mutex_lock() , except that if the mutex object is currently locked, the call returns immediately with an error code EBUSY. The value of 0 (success) is returned only if the mutex has been acquired.
CSC 7600 Lecture 11 : Pthreads Spring 2011
Pthread Mutex Types
• Normal – No deadlock detection on attempts to relock already locked mutex
• Error-checking – Error returned when locking a locked mutex
• Recursive – Maintains lock count variable – After the first acquisition of the mutex, the lock count is set to one – After each successful relock, the lock count is increased; after each
unlock, it is decremented – When the lock count drops to zero, thread loses the mutex
ownership • Default
– Attempts to lock the mutex recursively result in an undefined behavior
– Attempts to unlock the mutex which is not locked, or was not locked by the calling thread, results in undefined behavior
50
CSC 7600 Lecture 11 : Pthreads Spring 2011
51
Pthreads: Condition Variables Function: pthread_cond_wait(),
pthread_cond_signal(), pthread_cond_broadcast() int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex); int pthread_cond_signal(pthread_cond_t *cond); Int pthread_cond_broadcast(pthread_cond_t *cond); Description: The pthread_cond_wait() blocks on a condition variable associated with a mutex. The function must be called with a locked mutex argument. It atomically releases the mutex and causes the calling thread to block. While in that state, another thread is permitted to access the mutex. Subsequent mutex release should be announced by the accessing thread through pthread_cond_signal() or pthread_cond_broadcast(). Upon successful return from pthread_cond_wait(), the mutex is in locked state with the calling thread as its owner. The pthread_cond_signal() unblocks at least one of the threads that are blocked on the specified condition variable cond. The pthread_cond_broadcast() unblocks all threads currently blocked on the specified condition variable cond. All of these functions return zero on successful completion, or an error code otherwise.
CSC 7600 Lecture 11 : Pthreads Spring 2011
52
Example: Condition Variable
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; /* create default mutex */ pthread_cond_t cond = PTHREAD_COND_INITIALIZER; /* create default condition variable */ pthread_t prod_id, cons_id; item_t buffer; /* storage buffer (shared access) */ int empty = 1; /* buffer empty flag (shared access) */ ... pthread_create(&prod_id, NULL, producer, NULL); /* start producer thread */ pthread_create(&cons_id, NULL, consumer, NULL); /* start consumer thread */ ...
void *producer(void *none) { while (1) { /* obtain next item, asynchronously */ item_t item = compute_item(); pthread_mutex_lock(&mutex); /* critical section starts here */ while (!empty) /* wait until buffer is empty */ pthread_cond_wait(&cond, &mutex); /* store item, update status */ buffer = item; empty = 0; /* wake waiting consumer (if any) */ pthread_condition_signal(&cond); /* critical section done */ pthread_mutex_unlock(&mutex); } }
void *consumer(void *none) { while (1) { item_t item; pthread_mutex_lock(&mutex); /* critical section starts here */ while (empty) /* block (nothing in buffer yet) */ pthread_cond_wait(&cond, &mutex); /* grab item, update buffer status */ item = buffer; empty = 1; /* critical section done */ pthread_condition_signal(&cond); pthread_mutex_unlock(&mutex); /* process item, asynchronously */ consume_item(item); } }
Ini;aliza;on and startup
Simple producer thread Simple consumer thread
CSC 7600 Lecture 11 : Pthreads Spring 2011
53
Pthreads: Dynamic Initialization
Function: pthread_once()
int pthread_once(pthread_once_t *control, void (*init_routine)(void));
Description: The first call to pthread_once() by any thread in a process will call the init_routine() with no arguments. Subsequent calls to pthread_once() with the same control will not call init_routine().
#include <pthread.h> ... pthread_once init_ctrl = PTHREAD_ONCE_INIT; ... void initialize() {/* initialize global variables */} ... void *do_work(void *arg) { /* make sure global environment is set up */ pthread_once(&init_ctrl, initialize); /* start computations */ ... } ... pthread_t id; pthread_create(&id, NULL, do_work, NULL); ...
CSC 7600 Lecture 11 : Pthreads Spring 2011
54
Pthreads: Get Thread ID
Function: pthread_self()
pthread_t pthread_self(void);
Description: Returns the thread ID of the calling thread.
#include <pthread.h> ... pthread_t id; id = pthread_self(); ...
CSC 7600 Lecture 11 : Pthreads Spring 2011
55
Topics
• Introduction • Performance: CPI and memory behavior • Overview of threaded execution model • Programming with threads: basic concepts • Shared memory consistency models • Pitfalls of multithreaded programming • Thread implementations: approaches and issues • Pthreads: concepts and API • Summary
CSC 7600 Lecture 11 : Pthreads Spring 2011
Summary – Material for the Test
• Performance & cpi: slide 8 • Multi thread concepts: 13, 16, 18, 19, 22, 24, 31 • Thread implementations: 35 – 37 • Pthreads: 43 – 45, 48
56
CSC 7600 Lecture 11 : Pthreads Spring 2011
57
CSC 7600 Lecture 12 : OpenMPSpring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
OPENMP
Prof. Thomas Sterling
Department of Computer Science
Louisiana State University
February 24, 2011
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
2
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
3
CSC 7600 Lecture 12 : OpenMPSpring 2011
Where are we? (Take a deep breath …)
• 3 classes of parallel/distributed computing
– Capacity
– Capability
– Cooperative
• 3 classes of parallel architectures (respectively)
– Loosely coupled clusters and workstation farms
– Tightly coupled vector, SIMD, SMP
– Distributed memory MPPs (and some clusters)
• 3 classes of parallel execution models (respectively)
– Workflow, throughput, SPMD (ssh)
– Multithreaded with shared memory semantics (Pthreads)
– Communicating Sequential Processes (sockets)
• 3 classes of programming models
– Condor (Segment 1)
– OpenMP (Segment 3)
– MPI (Segment 2)
You Are Here
4
CSC 7600 Lecture 12 : OpenMPSpring 2011
HPC Modalities
5
Modalities Degree of Integration
Architectures Execution Models
Programming Models
Capacity Loosely Coupled Clusters & Workstation farms
Workflow Throughput
Condor
Capability Tightly Coupled Vectors, SMP, SIMD
Shared Memory Multithreading
OpenMP
Cooperative Medium DM MPPs & Clusters
CSP MPI
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
6
CSC 7600 Lecture 12 : OpenMPSpring 2011
7
Amdahl’s Law
g
ff
S
Tg
fTf
TS
Tg
fTfT
TTf
TTS
S
f
g
T
T
T
OO
O
OOA
OF
AO
F
A
O
1
1
1
1
appliedon acceleratin with computatio of up speed
daccelerate be n tocomputatio daccelerate-non offraction
ncomputatio ofportion dacceleratefor gain eperformancpeak
daccelerate becan n that computatio ofportion of time
ncomputatio dacceleratefor time
ncomputatio daccelerate-nonfor time
start end
TO
TF
start end
TA
TF/g
CSC 7600 Lecture 12 : OpenMPSpring 2011
Performance : Caches & Locality
• Temporal Locality is a property that if a program accesses a
memory location, there is a much higher than random probability
that the same location would be accessed again.
• Spatial Locality is a property that if a program accesses a
memory location, there is a much higher than random probability
that the nearby locations would be accessed soon.
• Spatial locality is usually easier to achieve than temporal locality
• A couple of key factors affect the relationship between locality
and scheduling :
– Size of dataset being processed by each processor
– How much reuse is present in the code processing a chunk of
iterations.
8
CSC 7600 Lecture 12 : OpenMPSpring 2011
Performance Shared Memory (OpenMP): Key
Factors
• Load Balancing :
– mapping workloads with thread scheduling
• Caches :
– Write-through
– Write-back
• Locality :
– Temporal Locality
– Spatial Locality
• How Locality affects scheduling algorithm selection
• Synchronization :
– Effect of critical sections on performance
9
CSC 7600 Lecture 12 : OpenMPSpring 2011
Performance : Caches & Locality
• Caches (Review) :
– for a C statement :
• a[i] = b[i]+c[i]
– the system accesses the memory locations referenced by b[i] and c[i] to the
processor, the result of the computation is subsequently stored in the memory
location referenced by a[i]
• Write-through caches: When a user writes some data, the data is immediately
written back to the memory, thus maintaining the cache-memory consistency.
In write through caches data in caches always reflect the data in the memory.
One of the main issues in write through caches is the increase in system
overhead required due to moving of large data between cache and memory.
• Write-back caches : When a user writes some data, the data is stored in the
cache and is not synchronized with the memory. Instead when the cache
content is different than the memory content, a bit entry is made in the cache.
While cleaning up caches the system checks for the entry in cache and if the
bit is set the system writes the changes to the memory.
10
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
11
CSC 7600 Lecture 12 : OpenMPSpring 2011
Introduction
• OpenMP is :– an API (Application Programming Interface)
– NOT a programming language
– A set of compiler directives that help the application developer to parallelize their workload.
– A collection of the directives, environment variables and the library routines
• OpenMP is composed of the following main components : – Directives
– Runtime library routines
– Environment variables
12
CSC 7600 Lecture 12 : OpenMPSpring 2011
Components of OpenMP
Environment variables
Number of threads
Scheduling type
Dynamic thread adjustment
Nested Parallelism
13
Directives
Parallel regions
Work sharing
Synchronization
Data scope attributes :• private• firstprivate• last private• shared• reduction
Orphaning
Runtime library routines
Number of threads
Thread ID
Dynamic thread adjustment
Nested Parallelism
Timers
API for locking
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Architecture
14
Operating System level Threads
OpenMP Runtime Library
Application
Environment Variables
User
Compiler Directives
Inspired by OpenMp.org introductory slides
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
15
CSC 7600 Lecture 12 : OpenMPSpring 2011
Runtime Library Routines
• Runtime library routines help manage parallel programs
• Many runtime library routines have corresponding environment
variables that can be controlled by the users
• Runtime libraries can be accessed by including omp.h in
applications that use OpenMP : #include <omp.h>
• For example for calls like :
– omp_get_num_threads(), (by which an openMP program determines
the number of threads available for execution) can be controlled using
an environment variable set at the command-line of a shell
($OMP_NUM_THREADS)
• Some of the activities that the OpenMP libraries help manage are :
– Determining the number of threads/processors
– Scheduling policies to be used
– General purpose locking and portable wall clock timing routines
16
CSC 7600 Lecture 12 : OpenMPSpring 2011
17
OpenMP : Runtime Library
Function: omp_get_num_threads()
C/ C++ int omp_get_num_threads(void);
Fortran integer function omp_get_num_threads()
Description:
Returns the total number of threads currently in the group executing the parallel
block from where it is called.
Function: omp_get_thread_num()
C/ C++ int omp_get_thread_num(void);
Fortran integer function omp_get_thread_num()
Description:
For the master thread, this function returns zero. For the child nodes the call returns
an integer between 1 and omp_get_num_threads()-1 inclusive.
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Environment Variables
• OpenMP provides 4 main environment variables for
controlling execution of parallel codes:
OMP_NUM_THREADS – controls the parallelism of the
OpenMP application
OMP_DYNAMIC – enables dynamic adjustment of number of
threads for execution of parallel regions
OMP_SCHEDULE – controls the load distribution in loops such
as do, for
OMP_NESTED – Enables nested parallelism in OpenMP
applications
18
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Environment Variables
19
Environment
Variable:
OMP_NUM_THREADS
Usage :
bash/sh/ksh:
csh/tcsh
OMP_NUM_THREADS n
export OMP_NUM_THREADS=8
setenv OMP_NUM_THREADS 8
Description:
Sets the number of threads to be used by the OpenMP program during execution.
Environment
Variable:
OMP_DYNAMIC
Usage :
bash/sh/ksh:
csh/tcsh
OMP_DYNAMIC {TRUE|FALSE}
export OMP_DYNAMIC=TRUE
setenv OMP_DYNAMIC “TRUE”
Description:
When this environment variable is set to TRUE the maximum number of threads available
for use by the OpenMP program is n ($OMP_NUM_THREADS).
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Environment Variables
20
Environment
Variable:
OMP_SCHEDULE
Usage :
bash/sh/ksh:
csh/tcsh
OMP_SCHEDULE “schedule,[chunk]”
export OMP_SCHEDULE static,N/P
setenv OMP_SCHEDULE=“GUIDED,4”
Description:
Only applies to for and parallel for directives. This environment variable sets the
schedule type and chunk size for all such loops. The chunk size can be provided as an
integer number, the default being 1.
Environment
Variable:
OMP_NESTED
Usage :
bash/sh/ksh:
csh/tcsh
OMP_NESTED {TRUE|FALSE}
export OMP_NESTED FALSE
setenv OMP_NESTED=“FALSE”
Description:
Setting this environment variable to TRUE enables multi-threaded execution of inner
parallel regions in nested parallel regions.
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP : Basic Constructs
C / C++ :
#pragma omp parallel {
parallel block
} /* omp end parallel */
21
OpenMP Execution Model (FORK/JOIN):
Sequential Part (master thread)
Parallel Region (FORK : group of threads)
Sequential Part (JOIN: master thread)
Parallel Region (FORK: group of threads)
Sequential Part (JOIN : master thread)
To invoke library routines in C/C++ add
#include <omp.h> near the top of your code
CSC 7600 Lecture 12 : OpenMPSpring 2011
HelloWorld in OpenMP
22
#include <omp.h>
main () {
int nthreads, tid;
#pragma omp parallel private(nthreads, tid){
tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);if (tid == 0) {
nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);
}}
}
Code segment that will be executed in parallel
OpenMP directive to indicate START segment to be parallelized
OpenMP directive to indicate END segment to be parallelized
Non shared copies of data for each thread
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Execution
• On encountering the C construct #pragma omp parallel{, n-1 extra threads are created
• omp_get_thread_num() returns a unique identifier for each thread that can be utilized. The value returned by this call is between 0 and (OMP_NUM_THREADS – 1)
• omp_get_num_threads() returns the total number of threads involved in the parallel section of the program
• Code after the parallel directive is executed independently on each of the nthreads.
• On encountering the C construct } (corresponding to #pragma omp parallel{ ), indicates the end of parallel execution of the code segment, the n-1 extra threads are deactivated and normal sequential execution begins.
23
CSC 7600 Lecture 12 : OpenMPSpring 2011
Compiling OpenMP Programs
Fortran :
• Case insensitive directives
• Syntax :
– !$OMP directive [clause[[,] clause]…] (free format)
– !$OMP / C$OMP / *$OMP directive [clause[[,] clause]…] (free format)
• Compiling OpenMP source code :
– (GNU Fortran compiler) : gfortran –fopenmp –o exec_name file_name.f95
– (Intel Fortran compiler) : ifort -o exe_file_name –openmp file_name.f
24
C :
• Case sensitive directives
• Syntax :
– #pragma omp directive [clause [clause]..]
• Compiling OpenMP source code :
– (GNU C compiler) : gcc –fopenmp –o exec_name file_name.c
– (Intel C compiler) : icc –o exe_file_name –openmp file_name.c
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
26
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP : Data Environment
• OpenMP program always begins with a single thread of control – master
thread
• Context associated with the master thread is also known as the Data
Environment.
• Context is comprised of :
– Global variables
– Automatic variables
– Dynamically allocated variables
• Context of the master thread remains valid throughout the execution of the
program
• The OpenMP parallel construct may be used to either share a single copy of
the context with all the threads or provide each of the threads with a private
copy of the context.
• The sharing of Context can be performed at various levels of granularity
– Select variables from a context can be shared while keeping the context private
etc.
27
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Data Environment
• OpenMP data scoping clauses allow a programmer to decide a variable’s execution context (should a variable be shared or private.)
• 3 main data scoping clauses in OpenMP (Shared, Private, Reduction) :
• Shared :
– A variable will have a single storage location in memory for the duration of the parallel construct, i.e. references to a variable by different threads access the same memory location.
– That part of the memory is shared among the threads involved, hence modifications to the variable can be made using simple read/write operations
– Modifications to the variable by different threads is managed by underlying shared memory mechanisms
• Private :
– A variable will have a separate storage location in memory for each of the threads involved for the duration of the parallel construct.
– All read/write operations by the thread will affect the thread’s private copy of the variable .
• Reduction :
– Exhibit both shared and private storage behavior. Usually used on objects that are the target of arithmetic reduction.
– Example : summation of local variables at the end of a parallel construct
28
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Work-Sharing Directives• Work sharing constructs divide the execution of the
enclosed block of code among the group of threads.
• They do not launch new threads.
• No implied barrier on entry
• Implicit barrier at the end of work-sharing construct
• Commonly used Work Sharing constructs :
– for directive (C/C++ ; equivalent DO construct available in
Fortran but will not be covered here) : shares iterations of a
loop across a group of threads
– sections directive : breaks work into separate sections
between the group of threads; such that each thread
independently executes a section of the work.
– critical directive: serializes a section of code
29
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP schedule clause
• The schedule clause defines how the iterations of a loop are divided
among a group of threads
• static : iterations are divided into pieces of size chunk and are
statically assigned to each of the threads in a round robin fashion
• dynamic : iterations divided into pieces of size chunk and
dynamically assigned to a group of threads. After a thread finishes
processing a chunk, it is dynamically assigned the next set of
iterations.
• guided : For a chunk of size of 1, the size of each chunk is
proportional to the number of unassigned iterations divided by the
number of threads, decreasing to 1. For a chunk with value k, the
same algorithm is used for determining the chunk size with the
constraint that no chunk should have less than k chunks except the
last chunk.
• Default schedule is implementation specific while the default chunk
size is usually 1
30
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP for directive
• for directive helps share iterations of a loop
between a group of threads
• If nowait is specified then the threads do not wait
for synchronization at the end of a parallel loop
• The schedule clause describes how iterations of
a loop are divided among the threads in the team
(discussed in detail in the next few slides)
31
#pragma omp parallel
{
p=5;
#pragma omp for
for (i=0; i<24; i++)
x[i]=y[i]+p*(i+3)
…
…
} /* omp end parallel */
p=5
i =0,4
p=5
i= 5,9
p=5
i= 20,24
fork
join
do / for loop
…
…
x[i]=y[i]+
…
x[i]=y[i]+
…
x[i]=y[i]+
…
…
CSC 7600 Lecture 12 : OpenMPSpring 2011
Simple Loop Parallelization
#pragma omp parallel for
for (i=0; i<n; i++)
z( i) = a*x(i)+y
32
Master thread executing serial portion of the code
Master thread encounters parallel for loop and creates worker threads
Master and worker threads divide iterations of the for loop and execute them concurrently
Implicit barrier: wait for all threads to finish their executions
Master thread executing serial portion of the code resumes and slave threads are discarded
CSC 7600 Lecture 12 : OpenMPSpring 2011
Example: OpenMP work sharing
Constructs
33
#include <omp.h>#define N 16main (){int i, chunk;float a[N], b[N], c[N];for (i=0; i < N; i++)a[i] = b[i] = i * 1.0;
chunk = 4;printf("a[i] + b[i] = c[i] \n");#pragma omp parallel shared(a,b,c,chunk) private(i){#pragma omp for schedule(dynamic,chunk) nowaitfor (i=0; i < N; i++)c[i] = a[i] + b[i];
} /* end of parallel section */for (i=0; i < N; i++)
printf(" %f + %f = %f \n",a[i],b[i],c[i]);}
Initializing the vectors a[i], b[i]
Instructing the runtime environment that a,b,c,chunk are shared variables and I is a private variable
Load balancing the threads using a DYNAMIC policy where array is divided into chunks of 4 and assigned to the threads
The nowait ensures that the child threads donot synchronize once their work is completed
Modified from examples posted on: https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 12 : OpenMPSpring 2011
DEMO : Work Sharing Constructs :
Shared / Private / Schedule
• Vector addition problem to be used
• Two vectors a[i] + b[i] = c[i] a[i] + b[i] = c[i]
0.000000 + 0.000000 = 0.000000
1.000000 + 1.000000 = 2.000000
2.000000 + 2.000000 = 4.000000
3.000000 + 3.000000 = 6.000000
4.000000 + 4.000000 = 8.000000
5.000000 + 5.000000 = 10.000000
6.000000 + 6.000000 = 12.000000
7.000000 + 7.000000 = 14.000000
8.000000 + 8.000000 = 16.000000
9.000000 + 9.000000 = 18.000000
10.000000 + 10.000000 = 20.000000
11.000000 + 11.000000 = 22.000000
12.000000 + 12.000000 = 24.000000
13.000000 + 13.000000 = 26.000000
14.000000 + 14.000000 = 28.000000
15.000000 + 15.000000 = 30.000000
34
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP sections directive • sections directive is a non iterative work sharing
construct.
• Independent section of code are nested within a sections directive
• It specifies enclosed section of codes between different threads
• Code enclosed within a section directive is executed by a thread within the pool of threads
35
#pragma omp parallel private(p)
{
#pragma omp sections
{{ a=…;
b=…;}
#pragma omp section
{ p=…;
q=…;}
#pragma omp section
{ x=…;
y=…;}
} /* omp end sections */
} /* omp end parallel */
a =
b =
p =
q =
x =
y =
fork
join
CSC 7600 Lecture 12 : OpenMPSpring 2011
Understanding variables in OpenMP
• Shared variable z is modified by multiple threads
• Each iteration reads the scalar variables a and y
and the array element x[i]
• a,y,x can be read concurrently as their values
remain unchanged.
• Each iteration writes to a distinct element of z[i]
over the index range. Hence write operations can
be carried out concurrently with each iteration
writing to a distinct array index and memory
location
• The parallel for directive in OpenMP ensures that
the for loop index value (i in this case) is private to
each thread.
36
i i i i
z[ ] a x[ ] y n i
#pragma omp parallel for
for (i=0; i<n; i++)
z[i] = a*x[i]+y
CSC 7600 Lecture 12 : OpenMPSpring 2011
Example : OpenMP Sections
37
#include <omp.h>#define N 16main (){int i;float a[N], b[N], c[N], d[N];for (i=0; i < N; i++)
a[i] = b[i] = i * 1.5;#pragma omp parallel shared(a,b,c,d) private(i){#pragma omp sections nowait
{#pragma omp sectionfor (i=0; i < N; i++)c[i] = a[i] + b[i];
#pragma omp sectionfor (i=0; i < N; i++)d[i] = a[i] * b[i];
} /* end of sections */} /* end of parallel section */…
Section : that computes the sum of the 2 vectors
Section : that computes the product of the 2 vectors
Sections construct that encloses the section calls
Modified from examples posted on: https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 12 : OpenMPSpring 2011
DEMO : OpenMP Sections
38
[LSU760000@n00 l12]$ ./sections a[i] b[i] a[i]+b[i] a[i]*b[i] 0.000000 0.000000 0.000000 0.000000 1.500000 1.500000 3.000000 2.250000 3.000000 3.000000 6.000000 9.000000 4.500000 4.500000 9.000000 20.250000 6.000000 6.000000 12.000000 36.000000 7.500000 7.500000 15.000000 56.250000 9.000000 9.000000 18.000000 81.000000 10.500000 10.500000 21.000000 110.250000 12.000000 12.000000 24.000000 144.000000 13.500000 13.500000 27.000000 182.250000 15.000000 15.000000 30.000000 225.000000 16.500000 16.500000 33.000000 272.250000 18.000000 18.000000 36.000000 324.000000 19.500000 19.500000 39.000000 380.250000 21.000000 21.000000 42.000000 441.000000 22.500000 22.500000 45.000000 506.250000
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
39
CSC 7600 Lecture 12 : OpenMPSpring 2011
Thread Synchronization
• “communication” mainly through read write operations on shared
variables
• Synchronization defines the mechanisms that help in coordinating
execution of multiple threads (that use a shared context) in a parallel
program.
• Without synchronization, multiple threads accessing shared memory
location may cause conflicts by :
– Simultaneously attempting to modify the same location
– One thread attempting to read a memory location while another thread is
updating the same location.
• Synchronization helps by providing explicit coordination between
multiple threads.
• Two main forms of synchronization :
– Implicit event synchronization
– Explicit synchronization – critical, master directives in OpenMP
40
CSC 7600 Lecture 12 : OpenMPSpring 2011
Basic Types of Synchronization
• Explicit Synchronization via mutual exclusion
– Controls access to the shared variable by providing a thread exclusive
access to the memory location for the duration of its construct.
– Critical directive of OpenMP provides mutual exclusion
• Event Synchronization
– Signals occurrence of an event across multiple threads.
– Barrier directives in OpenMP provide the simplest form of event
synchronization
– The barrier directive defines a point in a parallel program where each
thread waits for all other threads to arrive. This helps to ensure that all
threads have executed the same code in parallel upto the barrier.
– Once all threads arrive at the point, the threads can continue execution
past the barrier.
• Additional synchronization mechanisms available in OpenMP
41
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP Synchronization : master
• The master directive in OpenMP marks a block of code that gets
executed on a single thread.
• The rest of the threads in the group ignore the portion of code
marked by the master directive
• Example
#pragma omp master structured block
42
Race Condition :
Two asynchronous threads access the same shared variable and at least one modifies the variable and the sequence of operations is undefined . Result of these asynchronous operations depends on detailed timing of the individual threads of the group.
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP critical directive :
Explicit Synchronization
• Race conditions can be avoided by controlling access to shared variables by allowing threads to have exclusive access to the variables
• Exclusive access to shared variables allows the thread to atomically perform read, modify and update operations on the variable.
• Mutual exclusion synchronization is provided by the critical directive of OpenMP
• Code block within the critical region defined by critical /end critical directives can be executed only by one thread at a time.
• Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.
43
fork
join
- critical region
int x
x=0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x = 2*x + 1;
} /* omp end parallel */
CSC 7600 Lecture 12 : OpenMPSpring 2011
Simple Example : critical
44
cnt = 0;
f = 7;
#pragma omp parallel
{
#pragma omp for
for (i=0;i<20;i++){
if(b[i] == 0){
#pragma omp critical
cnt ++;
} /* end if */
a[i]=b[i]+f*(i+1);
} /* end for */
} /* omp end parallel */
cnt=0f=7
i =0,4 i=5,9 i= 20,24i= 10,14
if …if …
if … i= 20,24
cnt++
cnt++
cnt++
cnt++a[i]=b[i]+…
a[i]=b[i]+…
a[i]=b[i]+…
a[i]=b[i]+…
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
45
CSC 7600 Lecture 12 : OpenMPSpring 2011
OpenMP : Reduction
• performs reduction on shared variables in list based on the operator provided.
• for C/C++ operator can be any one of :
– +, *, -, ^, |, ||, & or &&
– At the end of a reduction, the shared variable contains the result obtained upon
combination of the list of variables processed using the operator specified.
46
sum = 0.0
#pragma omp parallel for reduction(+:sum)
for (i=0; i < 20; i++)
sum = sum + (a[i] * b[i]);
sum=0
i=0,4 i=5,9 i=10,14 i=15,19
sum=.. sum=.. sum=.. sum=..
∑sum
sum=0
CSC 7600 Lecture 12 : OpenMPSpring 2011
Example: Reduction
47
#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++){
a[i] = i * 1.0;b[i] = i * 2.0;
}#pragma omp parallel for default(shared) private(i) \
schedule(static,chunk) reduction(+:result)for (i=0; i < n; i++)
result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}
Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors
∑a[i]*b[i]
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 12 : OpenMPSpring 2011
Demo: Dot Product using Reduction
48
[LSU760000@n00 l12]$ ./reduction a[i] b[i] a[i]*b[i]0.000000 0.000000 0.0000001.000000 2.000000 2.0000002.000000 4.000000 8.0000003.000000 6.000000 18.0000004.000000 8.000000 32.0000005.000000 10.000000 50.0000006.000000 12.000000 72.0000007.000000 14.000000 98.0000008.000000 16.000000 128.0000009.000000 18.000000 162.00000010.000000 20.000000 200.00000011.000000 22.000000 242.00000012.000000 24.000000 288.00000013.000000 26.000000 338.00000014.000000 28.000000 392.00000015.000000 30.000000 450.000000Final result= 2480.000000
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
49
CSC 7600 Lecture 12 : OpenMPSpring 2011
Synopsis of Commands
• How to invoke OpenMP runtime systems #pragma omp parallel
• The interplay between OpenMP environment variables and
runtime system (omp_get_num_threads(),
omp_get_thread_num())
• Shared data directives such as shared, private and reduction
• Basic flow control using sections, for
• Fundamentals of synchronization using critical directive and
critical section.
• And directives used for the OpenMP programming part of the
problem set.
50
CSC 7600 Lecture 12 : OpenMPSpring 2011
Topics
• Review of HPC Models
• Shared Memory: Performance concepts
• Introduction to OpenMP
• OpenMP: Runtime Library & Environment Variables
• OpenMP: Data & Work sharing directives
• OpenMP: Synchronization
• OpenMP: Reduction
• Synopsis of Commands
• Summary Materials for Test
51
CSC 7600 Lecture 12 : OpenMPSpring 2011
Summary – Material for Test
• HPC Modalities – 4,5
• Performance issues in shared memory programming – 7,
8, 9, 10
• OpenMP runtime library routines – 16, 17
• OpenMP environment variables – 18, 19, 20
• OpenMP data environment 27, 28
• OpenMP work sharing directives – 29, 30, 31, 35, 36
• OpenMP thread synchronization – 40, 41, 42, 43
• OpenMP reduction 46
52
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
APPLIED PARALLEL ALGORITHMS 1
Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 10th, 2011
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Dr. Hartmut Kaiser
Center for Computation & Technology
R315 Johnston
2
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Puzzle of the Day
• What’s the difference between the following valid C
function declarations:
void foo();void foo(void);void foo(…);
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Puzzle of the Day
• What’s the difference between the following valid C
function declarations:
• What’s the difference between the following valid C++ function declarations:
void foo();void foo(void);void foo(…);
void foo(); any number of parametersvoid foo(void); no parametervoid foo(…); any number of parameters
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Puzzle of the Day
• What’s the difference between the following valid C
function declarations:
void foo(); any number of parametersvoid foo(void); no parametersvoid foo(…); any number of parameters
• What’s the difference between the following valid C++ function declarations:
void foo(); no parametersvoid foo(void); no parametersvoid foo(…); any number of parameters
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
6
Topics
• Introduction
• Mandelbrot Sets
• Monte Carlo : PI Calculation
• Vector Dot-Product
• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
7
Topics
• Introduction
• Mandelbrot Sets
• Monte Carlo : PI Calculation
• Vector Dot-Product
• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
8
Parallel Programming
• Goals
– Correctness
– Reduction in execution time
– Efficiency
– Scalability
– Increased problem size and richness of models
• Objectives
– Expose parallelism
• Algorithm design
– Distribute work uniformly
• Data decomposition and allocation
• Dynamic load balancing
– Minimize overhead of synchronization and communication
• Coarse granularity
• Big messages
– Minimize redundant work
• Still sometimes better than communication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
9
Basic Parallel (MPI) Program Steps
• Establish logical bindings
• Initialize application execution environment
• Distribute data and work
• Perform core computations in parallel (across nodes)
• Synchronize and Exchange intermediate data results– Optional for non-embarrassingly parallel (cooperative)
• Detect “stop” condition– Maybe implicit with a barrier etc.
• Aggregate final results– Often a reduction operator
• Output results and error code
• Terminate and return to OS
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
10
“embarrassingly parallel”
• Common phrase
– poorly defined,
– widely used
• Suggests lots and lots of parallelism
– with essentially no inter task communication or coordination
– Highly partitionable workload with minimal overhead
• “almost embarrassingly parallel”
– Same as above, but
– Requires master to launch many tasks
– Requires master to collect final results of tasks
– Sometimes still referred to as “embarrassingly parallel”
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
11
Topics
• Introduction
• Mandelbrot Sets
• Monte Carlo : PI Calculation
• Vector Dot-Product
• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Mandelbrot set
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
12
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson
& M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Mandelbrot Set
Set of points in a complex plane that are quasi-stable (will
increase and decrease, but not exceed some limit) when
computed by iterating the function
where zk+1 is the (k + 1)th iteration of the complex number z =
(a + bi) and c is a complex number giving position of point in
the complex plane. The initial value for z is zero.
Iterations continued until magnitude of z is greater than 2 or
number of iterations reaches arbitrary limit. Magnitude of z
is the length of the vector given by
13
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Sequential routine computing value of
one point returning number of iterations
structure complex {
float real;
float imag;
};
int cal_pixel(complex c)
{
int count, max;
complex z;
float temp, lengthsq;
max = 256;
z.real = 0; z.imag = 0;
count = 0; /* number of iterations */
do {
temp = z.real * z.real - z.imag * z.imag + c.real;
z.imag = 2 * z.real * z.imag + c.imag;
z.real = temp;
lengthsq = z.real * z.real + z.imag * z.imag;
count++;
} while ((lengthsq < 4.0) && (count < max));
return count;
}
14
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Parallelizing Mandelbrot Set Computation
Static Task Assignment
Simply divide the region into fixed number of parts, each
computed by a separate processor.
Not very successful because different regions require
different numbers of iterations and time.
Dynamic Task Assignment
Have processor request regions after computing previous
regions
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
15
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Dynamic Task AssignmentWork Pool/Processor Farms
16
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
17
Flowchart for Mandelbrot Set
Generation“master” “workers”
Initialize MPI Environment
Initialize MPI Environment
Initialize MPI Environment … Initialize MPI
Environment
Create Local Workload buffer
…
Create Local Workload buffer
Create Local Workload buffer
Create Local Workload buffer
Isolate work regions
Isolate work regions
Isolate work regions
Isolate work regions
Calculate Mandelbrot set
values across work region
…
… Calculate
Mandelbrot set values across work region
Calculate Mandelbrot set
values across work region
Calculate Mandelbrot set
values across work region
Write result from task 0 to file
Recv. results from “workers”
Send result to “master”
Send result to “master”
Send result to “master”…
Concatenate results to file
End
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
18
Mandelbrot Sets (source code)#include<stdio.h>
#include<assert.h>
#include<stdlib.h>
#include<mpi.h>
typedef struct complex{
double real;
double imag;
} Complex;
int cal_pixel(Complex c){
int count, max_iter;
Complex z;
double temp, lengthsq;
max_iter = 256;
z.real = 0;
z.imag = 0;
count = 0;
do{
temp = z.real * z.real - z.imag * z.imag + c.real;
z.imag = 2 * z.real * z.imag + c.imag;
z.real = temp;
lengthsq = z.real * z.real + z.imag * z.imag;
count ++;
}
while ((lengthsq < 4.0) && (count < max_iter));
return(count);
} Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
cal_pixel () runs on every worker process calculates the :
for every pixel
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
19
Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){FILE *file;int i, j; int tmp;Complex c;double *data_l, *data_l_tmp;int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype;MPI_Status status;
/***** Initializing MPI Environment*****/
MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nprocs);MPI_Comm_rank(MPI_COMM_WORLD, &mype);
/***** Pass in the dimension (X,Y) of the area to cover *****/
if (argc != 3){int err = 0;printf("argc %d\n", argc);if (mype == MASTERPE){
printf("usage: mandelbrot nx ny");MPI_Abort(MPI_COMM_WORLD,err );
}}/* get command line args */nx = atoi(argv[1]);ny = atoi(argv[2]);
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
Initialize MPI Environment
Check if the input arguments : x,y dimensions of the region to be processed are passed
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
20
Mandelbrot Sets (source code)
/* assume divides equally */nrows_l = nx/nprocs; mystrt = mype*nrows_l;myend = mystrt + nrows_l - 1;
/* create buffer for local work only */data_l = (double *) malloc(nrows_l * ny * sizeof(double));data_l_tmp = data_l;
/* calc each procs coordinates and call local mandelbrot value generation function */for (i = mystrt; i <= myend; ++i){c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){
c.imag = j/((double) ny) * 4. - 2. ;
tmp = cal_pixel(c); *data_l++ = (double) tmp;
}}data_l = data_l_tmp;
Source :
http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
Determining the dimensions of the work to be performed by each concurrent task.
Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
21
Mandelbrot Sets (source code)
if (mype == MASTERPE){file = fopen("mandelbrot.bin_0000", "w");printf("nrows_l, ny %d %d\n", nrows_l, ny);fwrite(data_l, nrows_l*ny, sizeof(double), file);fclose(file);for (i = 1; i < nprocs; ++i){
MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);printf("received message from proc %d\n", i);file = fopen("mandelbrot.bin_0000", "a");fwrite(data_l, nrows_l*ny, sizeof(double), file);fclose(file);}
}else{
MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD);}
MPI_Finalize();}
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
Master process opens a file to store output into and stores its values in the file
Master then waits to receive values computed by each of the worker processes
Worker processes send computed mandelbrot values of their region to the master process
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
22
Demo : Mandelbrot Sets
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo: Mandelbrot Sets
23
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
24
Topics
• Introduction
• Mandelbrot Sets
• Monte Carlo : PI Calculation
• Vector Dot-Product
• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Monte Carlo Simulation
• Used when it is infeasible or impossible to compute
an exact result with a deterministic algorithm
• Especially useful in
– Studying systems with a large number of coupled degrees
of freedom
• Fluids, disordered materials, strongly coupled solids, cellular
structures
– For modeling phenomena with significant uncertainty in
inputs
• The calculation of risk in business
– These methods are also widely used in mathematics
• The evaluation of definite integrals, particularly multidimensional
integrals with complicated boundary conditions
26
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Monte Carlo Simulation
• No single approach, multitude of different methods
• Usually follows pattern
– Define a domain of possible inputs
– Generate inputs randomly from the domain
– Perform a deterministic computation using the inputs
– Aggregate the results of the individual computations into the final result
• Example: calculate Pi
27
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
28
Monte Carlo: Algorithm for Pi
• The value of PI can be calculated in a number of
ways. Consider the following method of
approximating PI: Inscribe a circle in a square
• Randomly generate points in the square
• Determine the number of points in the square that
are also in the circle
• Let r be the number of points in the circle divided
by the number of points in the square
• PI ~ 4 r
• Note that the more points generated, the better
the approximation
• Algorithm :
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
30
OpenMP Pi Calculation
Initialize variables
Initialize OpenMP parallel environment
Calculate PI
Print value of pi
N WorkerThreadsMaster Thread
Generate random X,Y Generate random X,Y Generate random X,Y
Calculate Z=X^2+Y^2 Calculate Z =X^2+Y^2
If point lies within the circle
Calculate Z =X^2+Y^2
If point lies within the circle
If point lies within the circle
Count ++ Count ++Count ++
Reduction ∑
Y
N N N
Y Y
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Calculating Pi
31
#include <omp.h>#include <stdlib.h>#include <stdio.h>#include <time.h>#define SEED 42
main(int argc, char* argv){
int niter=0;double x,y;int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */double z;double pi;time_t rawtime;struct tm * timeinfo;
printf("Enter the number of iterations used to estimate pi: ");scanf("%d",&niter);time ( &rawtime );timeinfo = localtime ( &rawtime );
Seed for generating random number
http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Calculating Pi
32
printf ( "The current date/time is: %s", asctime (timeinfo) );/* initialize random numbers */srand(SEED);
#pragma omp parallel for private(x,y,z,tid) reduction(+:count)for ( i=0; i<niter; i++) {
x = (double)rand()/RAND_MAX;y = (double)rand()/RAND_MAX;z = (x*x+y*y);if (z<=1) count++;if (i==(niter/6)-1) {
tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);
}if (i==(niter/3)-1) {
tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);
}if (i==(niter/2)-1) {
tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);
} http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML
Initialize random number generator; srand is used to seed the random number generated by rand()
Randomly generate x,y points
Initialize OpenMP parallel for with reduction(∑)
Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Calculating Pi
33
if (i==(2*niter/3)-1) {tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);
}if (i==(5*niter/6)-1) {
tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);
}if (i==niter-1) {
tid = omp_get_thread_num();printf(" thread %i just did iteration %i the count is %i\n",tid,i,count);
}}time ( &rawtime );timeinfo = localtime ( &rawtime );printf ( "The current date/time is: %s", asctime (timeinfo) );printf(" the total count is %i\n",count);pi=(double)count/niter*4;
printf("# of trials= %d , estimate of pi is %g \n",niter,pi);return 0;
}
http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML
Calculate PI based on the aggregate count of the points that lie within the circle
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo : OpenMP Pi
34
[cdekate@celeritas l13]$ ./omcpiEnter the number of iterations used to estimate pi: 100000The current date/time is: Tue Mar 4 05:53:52 2008thread 0 just did iteration 16665 the count is 13124thread 1 just did iteration 33332 the count is 6514thread 1 just did iteration 49999 the count is 19609thread 2 just did iteration 66665 the count is 13048thread 3 just did iteration 83332 the count is 6445thread 3 just did iteration 99999 the count is 19489The current date/time is: Tue Mar 4 05:53:52 2008the total count is 78320# of trials= 100000 , estimate of pi is 3.1328[cdekate@celeritas l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
35
Creating Custom Communicators
• Communicators define groups and the access patterns
among them
• Default communicator is MPI_COMM_WORLD
• Some algorithms demand more sophisticated control of
communications to take advantage of reduction
operators
• MPI permits creation of custom communicators
• MPI_Comm_create
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
36
MPI Monte Carlo Pi Computation
Initialize MPIEnvironment
Receive Request
Compute Random Array
Send Array to Requestor
Last Request?
Finalize MPI
Y
N
Server
Initialize MPI Environment
WorkerMaster
Receive Error Bound
Send Request to Server
Receive Random Array
Perform Computations
Stop Condition Satisfied?
Finalize MPI
N
Y
Propagate Number of Points (Allreduce)
Initialize MPI Environment
Broadcast Error Bound
Send Request to Server
Receive Random Array
Perform Computations
Stop Condition Satisfied?
Print Statistics
N
Y
Propagate Number of Points (Allreduce)
Finalize MPI
Output Partial Result
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
37
Monte Carlo : MPI - Pi (source code)#include <stdio.h>#include <math.h>#include "mpi.h“#define CHUNKSIZE 1000#define INT_MAX 1000000000#define REQUEST 1#define REPLY 2int main( int argc, char *argv[] ){
int iter;int in, out, i, iters, max, ix, iy, ranks[1], done, temp;double x, y, Pi, error, epsilon;int numprocs, myid, server, totalin, totalout, workerid;int rands[CHUNKSIZE], request;MPI_Comm world, workers;MPI_Group world_group, worker_group;MPI_Status status;
MPI_Init(&argc,&argv);world = MPI_COMM_WORLD;MPI_Comm_size(world,&numprocs);MPI_Comm_rank(world,&myid);
Initialize MPI environment
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
38
Monte Carlo : MPI - Pi (source code)
server = numprocs-1; /* last proc is server */if (myid == 0)
sscanf( argv[1], "%lf", &epsilon );
MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD );MPI_Comm_group( world, &world_group );ranks[0] = server;MPI_Group_excl( world_group, 1, ranks, &worker_group );
MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group);
if (myid == server) { do {
MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {
for (i = 0; i < CHUNKSIZE; ) {rands[i] = random();if (rands[i] <= INT_MAX) i++; }/* Send random number array*/
MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); }} while( request>0 );
}else { /* Begin Worker Block */
request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );MPI_Comm_rank( workers, &workerid );iter = 0;
Broadcast Error Bounds: epsilon
Create a custom communicator
Server process : 1. Receives request to generate a random ,2. Computes the random number array, 3. Send array to requestor
Worker process : Request the server to generate a random number array
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
39
Monte Carlo : MPI - Pi (source code)while (!done) {
iter++;request = 1; /* Recv. random array from server*/
MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status );for (i=0; i<CHUNKSIZE-1; ) {
x = (((double) rands[i++])/max) * 2 - 1;y = (((double) rands[i++])/max) * 2 - 1;if (x*x + y*y < 1.0) in++;else out++;
}
MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers);MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers);Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi-3.141592653589793238462643);done = (error < epsilon || (totalin+totalout) > 1000000);request = (done) ? 0 : 1;if (myid == 0) { /* If “Master” : Print current value of PI */
printf( "\rpi = %23.20f", Pi );MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );
}else { /* If “Worker” : Request new array if not finished */
if (request)MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);
}}MPI_Comm_free(&workers);
}
Worker : Receive random number array from the Server
Worker: For each pair of x,y in the random number array, calculate the coordinates
Determine if the number is inside or out of the circle
Print current value of PI and request for more work
Compute the value of pi and Check if error is within threshhold
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
40
Monte Carlo : MPI - Pi (source code)
if (myid == 0) {/* If “Master” : Print Results */
printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n",totalin+totalout, totalin, totalout );
getchar();}MPI_Finalize();
}
Print the final value of PI
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
41
Demo : MPI Monte Carlo, Pi
> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
42
Topics
• Introduction
• Mandelbrot Sets
• Monte Carlo : PI Calculation
• Vector Dot-Product
• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Vector Dot Product
• Multiplication of 2 vectors followed by Summation
43
A[i]
X1
X2
X3
X4
X5
… …
Xn
B[i]
Y1
Y2
Y3
Y4
Y5
… …
Yn
∙ =n
i 1
A[i] * B[i]
X1* Y1
X2* Y2
X3* Y3
X4* Y4
X5* Y5
… …
Xn* Yn
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
44
OpenMP Dot Product : using Reduction
Initialize variables
Initialize OpenMP parallel environment
Calculate local computations
Calculate local computations
Calculate local computations
REDUCTION : ∑
Print value of Dot Product
N WorkerThreadsMaster Thread
Master Thread
Workload and schedule is determined by OpenMP
during runtime
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Dot Product
45
#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++){
a[i] = i * 1.0;b[i] = i * 2.0;
}#pragma omp parallel for default(shared) private(i) \
schedule(static,chunk) reduction(+:result)for (i=0; i < n; i++)
result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}
Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors
∑a*i+*b*i+
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo: Dot Product using Reduction
46
[cdekate@celeritas l12]$ ./reductiona[i] b[i] a[i]*b[i]0.000000 0.000000 0.0000001.000000 2.000000 2.0000002.000000 4.000000 8.0000003.000000 6.000000 18.0000004.000000 8.000000 32.0000005.000000 10.000000 50.0000006.000000 12.000000 72.0000007.000000 14.000000 98.0000008.000000 16.000000 128.0000009.000000 18.000000 162.00000010.000000 20.000000 200.00000011.000000 22.000000 242.00000012.000000 24.000000 288.00000013.000000 26.000000 338.00000014.000000 28.000000 392.00000015.000000 30.000000 450.000000Final result= 2480.000000[cdekate@celeritas l12]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
47
MPI Dot Product Computation
Initialize Variables
WorkerMaster
Initialize MPI environment
Receive Size of vectors
Receive local workload for Vector A
Receive local workload for Vector B
Initialize Variables
Initialize MPI Environment
Broadcast Size of Vectors
Get Vector A &Distribute Partitioned Vector A
Get Vector B & Distribute Partitioned Vector B
Calculate dot-product for local workloads
Print Result
REDUCTION ∑
Calculate dot-product for local workloads
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
48
#include <stdio.h>#include "mpi.h"#define MAX_LOCAL_ORDER 100main(int argc, char* argv[]) {
float local_x[MAX_LOCAL_ORDER];float local_y[MAX_LOCAL_ORDER];int n;int n_bar; /* = n/p */float dot;int p;int my_rank;void Read_vector(char* prompt, float local_v[], int n_bar, int p,
int my_rank);float Parallel_dot(float local_x[], float local_y[], int n_bar);
MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &p);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);if (my_rank == 0) {
printf("Enter the order of the vectors\n");scanf("%d", &n);
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
Initialize MPI Environment
Broadcast the order of vectors across the workers
Parallel Programming with MPI
by
Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
49
n_bar = n/p;
Read_vector("the first vector", local_x, n_bar, p, my_rank);Read_vector("the second vector", local_y, n_bar, p, my_rank);
dot = Parallel_dot(local_x, local_y, n_bar);
if (my_rank == 0)printf("The dot product is %f\n", dot);
MPI_Finalize();} /* main */
void Read_vector(char* prompt /* in */,float local_v[] /* out */,int n_bar /* in */,int p /* in */,int my_rank /* in */) {
int i, q;
Receive and distribute the two vectors
Calculate the parallel dot product for local workloads
Master: Print the result of the dot product
Parallel Programming with MPI
by
Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
50
float temp[MAX_LOCAL_ORDER];MPI_Status status;
if (my_rank == 0) {printf("Enter %s\n", prompt);for (i = 0; i < n_bar; i++)
scanf("%f", &local_v[i]);for (q = 1; q < p; q++) {
for (i = 0; i < n_bar; i++)scanf("%f", &temp[i]);
MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD);}
} else {MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD,
&status);}
} /* Read_vector */
float Serial_dot(float x[] /* in */,
MASTER: Get the input from the User prepare the local workload
Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing
Worker : Receive the local workload to be processed
Serial_dot() : calculates the dot product on local arrays
Parallel Programming with MPI by
Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
51
float y[] /* in */,int n /* in */) {
int i;float sum = 0.0;for (i = 0; i < n; i++)
sum = sum + x[i]*y[i];return sum;
} /* Serial_dot */float Parallel_dot(
float local_x[] /* in */,float local_y[] /* in */,int n_bar /* in */) {
float local_dot;float dot = 0.0;
local_dot = Serial_dot(local_x, local_y, n_bar);MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT,
MPI_SUM, 0, MPI_COMM_WORLD);return dot;
} /* Parallel_dot */
Serial_dot() : calculates the dot product on local arrays
Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workload
Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM)
Parallel Programming with MPI
by
Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo: MPI Dot Product
52
*cdekate@celeritas l13+$ mpirun …. ./mpi_dotEnter the order of the vectors16Enter the first vector0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Enter the second vector0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30The dot product is 2480.000000[cdekate@celeritas l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
53
Topics
• Introduction
• Mandelbrot Sets
• Monte Carlo : PI Calculation
• Vector Dot-Product
• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
54
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Matrix Vector Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
55
Matrix-Vector Multiplicationc = A xb
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
56
Implementing Matrix Multiplication
Sequential Code
Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be
for (i = 0; i < n; i++)
for (j = 0; j < n; j++) {c[i][j] = 0;for (k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];}
This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3).Very easy to parallelize.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Implementing Matrix Multiplication
• With n processors (and n x n matrices), we can obtain:
• Time complexity of O(n2) with n processors• Each instance of inner loop is independent and can be done by a
separate processor
• Time complexity of O(n) with n2 processors• One element of A and B assigned to each processor.
• Cost optimal since O(n3) = n x O(n2) = n2 x O(n).
• Time complexity of O(log n) with n3 processors• By parallelizing the inner loop.
• Not cost-optimal since O(n3) < n3 x O(log n).
• O(log n) lower bound for parallel matrix multiplication.
57
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
58
Block Matrix Multiplication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Partitioning into sub-matricies
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
59
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
60
Performance Improvement
Using tree construction n numbers can be added in O(log n) steps (using n3 processors):
Slides for Parallel Programming Techniques & Applications Using Networked
Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @
2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
61
OpenMP: Flowchart for Matrix Multiplication
Initialize variables & matrices
Initialize OpenMP Environment
Compute the Matrix product for the local workload
Print Results
Compute the Matrix product for the local workload
Compute the Matrix product for the local workload
Schedule and workload chunksize are determined based on user preferences
during compile/run time
Since each thread works on portion of the array and updates different parts of the same
array synchronization is not needed
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Matrix Multiplication
62
#include <stdio.h>#include <omp.h>
/* Main Program */
main(){
int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k;NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4;float Matrix_A[NoofRows_A][NoofCols_A];float Matrix_B[NoofRows_B][NoofCols_B];float Result[NoofRows_A][NoofCols_B];
for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_A; j++)
Matrix_A[i][j] = i + j;}/* Matrix_B Elements */for (i = 0; i < NoofRows_B; i++) {
for (j = 0; j < NoofCols_B; j++)Matrix_B[i][j] = i + j;
}printf("The Matrix_A Is \n");
Initialize the two Matrices A[][] & B[][] with sum of their index values
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Matrix Multiplication
63
for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_A; j++)
printf("%f \t", Matrix_A[i][j]);printf("\n");
}printf("The Matrix_B Is \n");for (i = 0; i < NoofRows_B; i++) {
for (j = 0; j < NoofCols_B; j++)printf("%f \t", Matrix_B[i][j]);
printf("\n");}
for (i = 0; i < NoofRows_A; i++) {for (j = 0; j < NoofCols_B; j++) {
Result[i][j] = 0.0;}
}#pragma omp parallel for private(j,k)
for (i = 0; i < NoofRows_A; i = i + 1)for (j = 0; j < NoofCols_B; j = j + 1)
for (k = 0; k < NoofCols_A; k = k + 1)Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j];
printf("\nThe Matrix Computation Result Is \n");
Initialize the results matrix with 0.0
Print the Matrices for debugging purposes
Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMPenvironment variables and the number of threads
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Matrix Multiplicaton
64
for (i = 0; i < NoofRows_A; i = i + 1) {for (j = 0; j < NoofCols_B; j = j + 1)
printf("%f ", Result[i][j]);printf("\n");
}}
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
DEMO : OpenMP Matrix Multiplication
65
[cdekate@celeritas l13]$ ./omp_mmThe Matrix_A Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000The Matrix_B Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000
The Matrix Computation Result Is14.000000 20.000000 26.000000 32.00000020.000000 30.000000 40.000000 50.00000026.000000 40.000000 54.000000 68.00000032.000000 50.000000 68.000000 86.000000[cdekate@celeritas l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
66
Flowchart for MPI Matrix Multiplication
“master” “workers”
Initialize MPI Environment
Initialize MPI Environment
Initialize MPI Environment
… Initialize MPI Environment
Initialize Array
Partition Array into workloads
Send Workload to “workers”
Recv. work Recv. work … Recv. work
wait for “workers“ to finish task
Calculate matrix product
Calculate matrix product
Calculate matrix product
…
Send result Send result … Send result
Recv. results
Print results
End
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
67
Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 4 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */
taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */
double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */
MPI_Status status;
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
Initialize the MPI environment
Source : http://www.llnl.gov/computing/
tutorials/mpi/samples/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
68
Matrix Multiplication (source code)if (numtasks < 2 ) {printf("Need at least two MPI tasks. Quitting...\n");MPI_Abort(MPI_COMM_WORLD, rc);exit(1);}
numworkers = numtasks-1;
if (taskid == MASTER){for (i=0; i<NRA; i++)for (j=0; j<NCA; j++){
a[i][j]= i+j+1;b[i][j]= i+j+1; }
printf("Matrix A :: \n");
for (i=0; i<NRA; i++){printf("\n");for (j=0; j<NCB; j++)
printf("%6.2f ", a[i][j]); }printf("Matrix B :: \n");for (i=0; i<NRA; i++) {
printf("\n");for (j=0; j<NCB; j++)
printf("%6.2f ", b[i][j]);averow = NRA/numworkers;extra = NRA%numworkers;offset = 0;mtype = FROM_MASTER;
Source : http://www.llnl.gov/computing/
tutorials/mpi/samples/C/mpi_mm.c
MASTER: Initialize the matrix A & B
Print the two matrices for Debugging purposes
Calculate the number of rows to be processed by each worker
Calculate the number of overflow rows to be processed additionally by each worker
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
69
Matrix Multiplication (source code)for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */
rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);
MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;}
/* Receive results from worker tasks */mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */for (i=1; i<=numworkers; i++){
source = i;/* offset stores the (processing) starting point of work chunk */
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status);printf("Received results from task %d\n",source);
}printf("******************************************************\n");printf("Result Matrix:\n");for (i=0; i<NRA; i++){
printf("\n"); for (j=0; j<NCB; j++)
printf("%6.2f ", c[i][j]);}printf("\n******************************************************\n");printf ("Done.\n");
}
MASTER : Send the workload chunk across to each of the worker
MASTER: Receive the workload chunk from the workersc[][] contains the matrix products calculated for each workload chunk by the corresponding worker
Source : http://www.llnl.gov/computing/
tutorials/mpi/samples/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
70
Matrix Multiplication (source code)/**************************** worker task ************************************/
if (taskid > MASTER){
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)for (i=0; i<rows; i++){
c[i][k] = 0.0;for (j=0; j<NCA; j++)
/* Calculate the product and store result in C */c[i][k] = c[i][k] + a[i][j] * b[j][k];
}mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
/* Worker sends the resultant array to the master */MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}MPI_Finalize();
}
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
WORKER: Receive the workload to be processed by each worker
Calculate the matrix product and store the result in c[][]
Send the computed results array to the Master
Source : http://www.llnl.gov/com
puting/tutorials/mpi/sample
s/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
71
Demo : Matrix Multiplication
[cdekate@celeritas matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Matrix A ::1.00 2.00 3.00 4.002.00 3.00 4.00 5.003.00 4.00 5.00 6.004.00 5.00 6.00 7.00
Matrix B ::1.00 2.00 3.00 4.002.00 3.00 4.00 5.003.00 4.00 5.00 6.004.00 5.00 6.00 7.00
Sending 2 rows to task 1 offset=0Sending 1 rows to task 2 offset=2Sending 1 rows to task 3 offset=3Received results from task 1Received results from task 2Received results from task 3Result Matrix:30.00 40.00 50.00 60.0040.00 54.00 68.00 82.0050.00 68.00 86.00 104.0060.00 82.00 104.00 126.00[cdekate@celeritas matrix_multiplication]$
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
APPLIED PARALLEL ALGORITHMS 2
Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 18, 2011
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Puzzle of the Day
• Some nice ways to get something different from what
was intended:
2
if(a = 0) { … }/* a always equals 0, but block will never be executed */
if(0 < a < 5) { … }/* this "boolean" is always true! [think: (0 < a) < 5] */
if(a =! 0) { … }/* a always equal to 1, as this is compiled as (a = !0), an assignment,
rather than (a != 0) or (a == !0) */
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition
• Matrix Transpose
• Gauss-Jordan Elimination
• LU Decomposition
• Summary Materials for Test
3
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition
• Matrix Transpose
• Gauss-Jordan Elimination
• LU Decomposition
• Summary Materials for Test
4
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
5
Parallel Matrix Processing & Locality
• Maximize locality– Spatial locality
• Variable likely to be used if neighbor data is used
• Exploits unit or uniform stride access patterns
• Exploits cache line length
• Adjacent blocks minimize message traffic– Depends on volume to surface ratio
– Temporal locality• Variable likely to be reused if already recently used
• Exploits cache loads and LRU (least recently used) replacement policy
• Exploits register allocation
– Granularity• Maximizes length of local computation
• Reduces number of messages
• Maximizes length of individual messages
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
6
Array Decomposition
• Simple MPI Example
• Master-Worker Data Partitioning and Distribution– Array decomposition
– Uniformly distributes parts of array among workers• (and master)
– A kind of static load balancing• Assumes equal work on equal data set sizes
• Demonstrates– Data partitioning
– Data distribution
– Coarse grain parallel execution• No communication between tasks
– Reduction operator
– Master-worker control model
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
7
Array Decomposition Layout
• Dimensions – 1 dimension: linear (dot product)
– 2 dimensions: “2-D” or (matrix operations)
– 3 dimensions (higher order models)
– Impacts surface to volume ratio for inter process communications
• Distribution – Block
• Minimizes messaging
• Maximizes message size
– Cyclic
• Improves load balancing
• Memory layout– C vs. FORTRAN
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
8
Array Decomposition
Accumulate sum from each part
rayCompleteAr
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
9
Array Decomposition
Demonstrate simple data decomposition :
– Master initializes array and then distributes an equal portion of the array
among the other tasks.
– The other tasks receive their portion of the array, they perform an
addition operation to each array element.
– Each task maintains the sum for their portion of the array
– The master task does likewise with its portion of the array.
– As each of the non-master tasks finish, they send their updated portion
of the array to the master.
– An MPI collective communication call is used to collect the sums
maintained by each task.
– Finally, the master task displays selected parts of the final array and the
global sum of all array elements.
– Assumption : that the array can be equally divided among the group.
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
10
Flowchart for Array Decomposition“master” “workers”
Initialize MPI Environment
Initialize MPI Environment
Initialize MPI Environment
… Initialize MPI Environment
Initialize Array
Partition Array into workloads
Send Workload to “workers”
Recv. work Recv. work … Recv. work
Calculate Sum for array chunk
Calculate Sum for array chunk
Calculate Sum for array chunk
Calculate Sum for array chunk
…
Send Sum Send Sum … Send Sum
Recv. results
Reduction Operator to Sum up results
Print results
End
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
11
Array Decompositon (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define ARRAYSIZE 16000000#define MASTER 0
float data[ARRAYSIZE];int main (int argc, char **argv){int numtasks, taskid, rc, dest, offset, i, j, tag1,
tag2, source, chunksize; float mysum, sum;float update(int myoffset, int chunk, int myid);
MPI_Status status;
MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks % 4 != 0) {
printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/MPI_Abort(MPI_COMM_WORLD, rc);exit(0);}
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);printf ("MPI task %d has started...\n", taskid);
chunksize = (ARRAYSIZE / numtasks);tag2 = 1;tag1 = 2;
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
Workload to be processed by each processor
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
12
Array Decompositon (source code)
if (taskid == MASTER){sum = 0;
for(i=0; i<ARRAYSIZE; i++) {data[i] = i * 1.0;sum = sum + data[i];}
printf("Initialized array sum = %e\n",sum);
offset = chunksize;for (dest=1; dest<numtasks; dest++) {MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD);printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset);offset = offset + chunksize;}
offset = 0;
mysum = update(offset, chunksize, taskid);
for (i=1; i<numtasks; i++) {source = i;
MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);}
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
Initialize array
Array[0] -> Array[offset-1] is processed by master
Send workloads to respective processorsMaster computes
local Sum
Master receives summation computed by workers
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
13
Array Decompositon (source code)
MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);printf("Sample results: \n");offset = 0;for (i=0; i<numtasks; i++) {
for (j=0; j<5; j++) printf(" %e",data[offset+j]);
printf("\n");offset = offset + chunksize;}
printf("*** Final sum= %e ***\n",sum);} /* end of master section */
if (taskid > MASTER) {/* Receive my portion of array from the master task */source = MASTER;
MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);mysum = update(offset, chunksize, taskid);/* Send my results back to the master task */dest = MASTER;
MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD);MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);} /* end of non-master */
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
Master computes the SUM of all workloads
Worker processes receive work chunks from master
Each worker computes local sum
Send local sum to master process
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
14
Array Decompositon (source code)
MPI_Finalize();
} /* end of main */
float update(int myoffset, int chunk, int myid) {int i; float mysum;/* Perform addition to each of my array elements and keep my sum */mysum = 0;for(i=myoffset; i < myoffset + chunk; i++) {data[i] = data[i] + i * 1.0;mysum = mysum + data[i];}
printf("Task %d mysum = %e\n",myid,mysum);return(mysum);}
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
15
Demo : Array Decomposition
[lsu00@master array_decomposition]$ mpiexec -np 4 ./array
MPI task 0 has started...
MPI task 2 has started...
MPI task 1 has started...
MPI task 3 has started...
Initialized array sum = 1.335708e+14
Sent 4000000 elements to task 1 offset= 4000000
Sent 4000000 elements to task 2 offset= 8000000
Task 1 mysum = 4.884048e+13
Sent 4000000 elements to task 3 offset= 12000000
Task 2 mysum = 7.983003e+13
Task 0 mysum = 1.598859e+13
Task 3 mysum = 1.161867e+14
Sample results:
0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00
8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06
1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07
2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07
*** Final sum= 2.608458e+14 ***
Output from arete for a 4 processor run.
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition
• Matrix Transpose
• Gauss-Jordan Elimination
• LU Decomposition
• Summary Materials for Test
16
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose• The transpose of the (m x n) matrix A is the (n x m) matrix
formed by interchanging the rows and columns such that row ibecomes column i of the transposed matrix
mnnn
m
m
T
aaa
aaa
aaa
21
22212
12111
A
mnmm
n
n
aaa
aaa
aaa
21
22221
11211
A
010
431A
04
13
01
TA
52
31A
53
21TA
17
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - OpenMP
18
#include <stdio.h>#include <sys/time.h>#include <omp.h>#define SIZE 4
main(){
int i, j;float Matrix[SIZE][SIZE], Trans[SIZE][SIZE];for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++)Matrix[i][j] = (i * j) * 5 + i;
}for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++)Trans[i][j] = 0.0;
}
Initialize source matrix
Initialize results matrix
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - OpenMP
19
#pragma omp parallel for private(j)for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
Trans[j][i] = Matrix[i][j];printf("The Input Matrix Is \n");for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++)printf("%f \t", Matrix[i][j]);
printf("\n");}printf("\nThe Transpose Matrix Is \n");for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++)printf("%f \t", Trans[i][j]);
printf("\n");}
return 0;}
Perform transpose in parallel using omp parallel for
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose – OpenMP (DEMO)
20
[LSU760000@n01 matrix_transpose]$ ./omp_mtrans
The Input Matrix Is 0.000000 0.000000 0.0000000 0.0000000 1.000000 6.000000 11.000000 16.000000 2.000000 12.000000 22.000000 32.000000 3.000000 18.000000 33.000000 48.000000
The Transpose Matrix Is 0.000000 1.0000000 2.0000000 3.0000000 0.000000 6.0000000 12.000000 18.000000 0.000000 11.000000 22.000000 33.000000 0.000000 16.000000 32.000000 48.000000
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - MPI
21
#include <stdio.h>#include "mpi.h"#define N 4int A[N][N];void fill_matrix(){int i,j;for(i = 0; i < N; i ++)
for(j = 0; j < N; j ++)A[i][j] = i * N + j;
}void print_matrix(){int i,j;for(i = 0; i < N; i ++) {
for(j = 0; j < N; j ++)printf("%d ", A[i][j]);
printf("\n");}
}
Initialize source matrix
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - MPI
22
main(int argc, char* argv[]){int r, i;MPI_Status st;MPI_Datatype typ;
MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &r);
if(r == 0) {fill_matrix();printf("\n Source:\n");print_matrix();MPI_Type_contiguous(N * N, MPI_INT, &typ);MPI_Type_commit(&typ);MPI_Barrier(MPI_COMM_WORLD);MPI_Send(&(A[0][0]), 1, typ, 1, 0, MPI_COMM_WORLD);
}
Creating custom MPI datatypeto store local workloads
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose - MPI
23
else if(r == 1){MPI_Type_vector(N, 1, N, MPI_INT, &typ);MPI_Type_hvector(N, 1, sizeof(int), typ, &typ);MPI_Type_commit(&typ);MPI_Barrier(MPI_COMM_WORLD);MPI_Recv(&(A[0][0]), 1, typ, 0, 0, MPI_COMM_WORLD, &st);printf("\n Transposed:\n");print_matrix();
}
MPI_Finalize();}
Creates a vector datatype of length N strided by a blocklength of 1
Datatype MPI_Type_hvector allows for on the fly transpose of the matrix
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Matrix Transpose – MPI (DEMO)
24
[LSU760000@n01 matrix_transpose]$ mpiexiec -np 2 ./mpi_mtrans
Source:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Transposed:0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition
• Matrix Transpose
• Gauss-Jordan Elimination
• LU Decomposition
• Summary Materials for Test
25
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Linear Systems
3333232131
2323222121
1313212111
bxaxaxa
bxaxaxa
bxaxaxa
3
2
1
3
2
1
333231
232221
131211
b
b
b
x
x
x
aaa
aaa
aaa
Solve Ax=b, where A is an n n matrix andb is an n 1 column vector
www.cs.princeton.edu/courses/archive/fall07/cos323/
26
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Fundamental operations:
1. Replace one equation with linear combination
of other equations
2. Interchange two equations
3. Re-label two variables
• Combine to reduce to trivial system
• Simplest variant only uses #1 operations but get better
stability by adding
– #2 or
– #2 and #3
www.cs.princeton.edu/courses/archive/fall07/cos323/
27
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Solve:
• Can be represented as
• Goal: to reduce the LHS to an identity matrix resulting
with the solutions in RHS
1354
732
21
21
xx
xx
13
7
54
32
?
?
10
01
www.cs.princeton.edu/courses/archive/fall07/cos323/
28
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Basic operation 1: replace any row by
linear combination with any other row :
replace row1 with 1/2 * row1 + 0 * row2
• Replace row2 with row2 – 4 * row1
• Negate row2
13
7
54
32
1354
1 27
23
110
1 27
23
110
1 27
23
www.cs.princeton.edu/courses/archive/fall07/cos323/
29
Row1 = (Row1)/2
Row2=Row2-(4*Row1)
Row2 = (-1)*Row2
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Gauss-Jordan Elimination
• Replace row1 with row1 – 3/2 * row2
• Solution:
x1 = 2, x2 = 1
110
1 27
23
1
2
10
01
www.cs.princeton.edu/courses/archive/fall07/cos323/
30
Row1 = Row1 – (3/2)* Row2
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Pivoting
• Consider this system:
• Immediately run into problem: algorithm wants us to divide by zero!
• More subtle version:
• The pivot or pivot element is the element of a matrix which is
selected first by an algorithm to do computation
• Pivot entry is usually required to be at least distinct from zero, and
often distant from it
• Select largest element in matrix and swap columns and rows to
bring this element to the ‚right’ position: full (complete) pivoting
8
2
32
10
8
2
32
1001.0
www.cs.princeton.edu/courses/archive/fall07/cos323/
31
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Pivoting
• Consider this system:
• Pivoting :– Swap rows 1 and 2:
– And continue to solve as shown before
1
8
10
23
1
2
10
01
110
1 38
32
www.cs.princeton.edu/courses/archive/fall07/cos323/
32
x1 = 2, x2 = 1
8
1
23
10
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Pivoting:Example
• Division by small numbers round-off error in computer arithmetic
• Consider the following system0.0001x1 + x2 = 1.000
x1 + x2 = 2.000
• exact solution: x1=1.0001 and x2 = 0.9999
• say we round off after 3 digits after the decimal point
• Multiply the first equation by 104 and subtract it from the second equation
• (1 - 1)x1 + (1 - 104)x2 = 2 - 104
• But, in finite precision with only 3 digits:
– 1 - 104 = -0.9999 E+4 ~ -0.999 E+4
– 2 - 104 = -0.9998 E+4 ~ -0.999 E+4
• Therefore, x2 = 1 and x1 = 0 (from the first equation)
• Very far from the real solution!
0.0001 1
1 1
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
33
1
2
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Partial Pivoting
• Partial pivoting doesn‘t look for largest element in matrix,
but just for the largest element in the ‚current‘ column
• Swap rows to bring the corresponding row to ‚right‘
position
• Partial pivoting is generally sufficient to adequately
reduce round-off error.
• Complete pivoting is usually not necessary to ensure
numerical stability
• Due to the additional computations it introduces, it may
not always be the most appropriate pivoting strategy
34
http://www.amath.washington.edu/~bloss/amath352_lectures/
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Partial Pivoting• One can just swap rows
x1 + x2 = 2.000
0.0001x1 + x2 = 1.000
• Multiple the first equation by 0.0001 and subtract it from the second equation gives:
(1 - 0.0001)x2 = 1 - 0.0001
0.9999 x2 = 0.9999 => x2 = 1
and then x1 = 1
• Final solution is closer to the real solution.
• Partial Pivoting
– For numerical stability, one doesn’t go in order, but pick the next row in rows i to n that has the largest element in row i
– This row is swapped with row i (along with elements of the right hand side) before the subtractions
• the swap is not done in memory but rather one keeps an indirection array
• Total Pivoting
– Look for the greatest element ANYWHERE in the matrix
– Swap columns
– Swap rows
• Numerical stability is really a difficult field
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
35
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Partial Pivoting
36
http://www.amath.washington.edu/~bloss/amath352_lectures/
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Special Cases
• Common special case:
• Tri-diagonal Systems :
– Only main diagonal & 1 above,1 below
– Solve using : Gauss-Jordan
• Lower Triangular Systems (L)
– Solve using : forward substitution
• Upper Triangular Systems (U)
– Solve using : backward substitution
4
3
2
1
4443
343332
232221
1211
00
0
0
00
b
b
b
b
aa
aaa
aaa
aa
4
3
2
1
44434241
333231
2221
11
0
00
000
b
b
b
b
aaaa
aaa
aa
a
11
1
1
a
bx
22
1212
2
a
xabx
33
2321313
3
a
xaxabx
5
4
3
2
1
55
4544
353433
25242322
1514131211
0000
000
00
0
b
b
b
b
b
a
aa
aaa
aaaa
aaaaa
55
5
5
a
bx
44
5454
4
a
xabx
www.cs.princeton.edu/courses/archive/fall07/cos323/
37
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition
• Matrix Transpose
• Gauss-Jordan Elimination
• LU Decomposition
• Summary Materials for Test
38
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Solving Linear Systems of Eq.
• Method for solving Linear Systems
– The need to solve linear systems arises in an estimated 75% of all scientific computing problems [Dahlquist 1974]
• Gaussian Elimination is perhaps the most well-known method
– based on the fact that the solution of a linear system is invariant under scaling and under row additions
• One can multiply a row of the matrix by a constant as long as one multiplies the corresponding element of the right-hand side by the same constant
• One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side
– Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:
?
?
?
?
?
x =
equation n-i has i unknowns, with
?
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
39
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Gaussian Elimination1 1 1
1 -2 2
1 2 -1
0
4
2
x =
1 1 1
0 -3 1
0 1 -2
0
4
2
x =
1 1 1
0 -3 1
0 0 -5
0
4
10
x =
Subtract row 1 from rows 2 and 3
Multiple row 3 by 3 and add row 2
-5x3 = 10 x3 = -2
-3x2 + x3 = 4 x2 = -2
x1 + x2 + x3 = 0 x1 = 4
Solving equations in
reverse order (backsolving)
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
40
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Gaussian Elimination
• The algorithm goes through the matrix from the top-left
corner to the bottom-right corner
• The ith step eliminates non-zero sub-diagonal elements
in column i, subtracting the ith row scaled by aji/aii from
row j, for j=i+1,..,n.
i
0
values already computed
values yet to be
updated
pivot row ito
be
ze
roe
d
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
41
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Sequential Gaussian Elimination
Simple sequential algorithm
// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1
// for each row j below row ifor j = i+1 to n
// add a multiple of row i to row jfor k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
• Several “tricks” that do not change the spirit of the algorithm but
make implementation easier and/or more efficient
– Right-hand side is typically kept in column n+1 of the matrix and one speaks of an augmented matrix
– Compute the A(i,j)/A(i,i) term outside of the loop
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
42
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Parallel Gaussian Elimination?
• Assume that we have one processor per matrix element
Reduction
to find the max aji
Broadcast
max aji needed to compute
the scaling factor
Compute
Independent computation
of the scaling factor
Broadcasts
Every update needs the
scaling factor and the
element from the pivot
row
Compute
Independent
computations
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
43
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
LU Factorization
• Gaussian Elimination is simple but
– What if we have to solve many Ax = b systems for different values of b?• This happens a LOT in real applications
• Another method is the “LU Factorization” (LU Decomposition)
• Ax = b
• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper triangular matrix O(n3)
• Then Ax = b is written L U x = b
• Solve L y = b O(n2)
• Solve U x = y O(n2)
?
?
?
?
?
?
x =
?
?
?
?
?
?
x =
equation i has i unknowns equation n-i has i unknowns
triangular system solves are easy
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
44
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
LU Factorization: Principle
• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.
• Magically, A = L x U !
• Should be done with pivoting as well
1 2 -1
4 3 1
2 2 3
1 2 -1
0 -5 5
2 2 3
gaussian
elimination
save the
scaling
factor
1 2 -1
4 -5 5
2 2 3
gaussian
elimination
+
save the
scaling
factor
1 2 -1
4 -5 5
2 -2 5
gaussian
elimination
+
save the
scaling
factor
1 2 -1
4 -5 5
2 2/5 3
1 0 0
4 1 0
2 2/5 1
L = 1 2 -1
0 -5 5
0 0 3
U =
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
45
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
LU Factorization
stores the scaling factors
k
k
LU-sequential(A,n) {
for k = 0 to n-2 {
// preparing column k
for i = k+1 to n-1
aik -aik / akk
for j = k+1 to n-1
// Task Tkj: update of column j
for i=k+1 to n-1
aij aij + aik * akj
}
}
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
• We’re going to look at the simplest possible version
– No pivoting: just creates a bunch of indirections that are easy but make
the code look complicated without changing the overall principle
46
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
LU Factorization
• We’re going to look at the simplest possible version
– No pivoting: just creates a bunch of indirections that are easy but make
the code look complicated without changing the overall principle
LU-sequential(A,n) {
for k = 0 to n-2 {
// preparing column k
for i = k+1 to n-1
aik -aik / akk
for j = k+1 to n-1
// Task Tkj: update of column j
for i=k+1 to n-1
aij aij + aik * akj
}
}
k
i
j
k
update
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
47
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Parallel LU on a ring
• Since the algorithm operates by columns from left to right, we should
distribute columns to processors
• Principle of the algorithm
– At each step, the processor that owns column k does the “prepare” task
and then broadcasts the bottom part of column k to all others
• Annoying if the matrix is stored in row-major fashion
• Remember that one is free to store the matrix in anyway one wants, as long
as it’s coherent and that the right output is generated
– After the broadcast, the other processors can then update their data.
• Assume there is a function alloc(k) that returns the rank of the
processor that owns column k
– Basically so that we don’t clutter our program with too many global-to-
local index translations
• In fact, we will first write everything in terms of global indices, as to
avoid all annoying index arithmetic
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
48
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
LU-broadcast algorithm
LU-broadcast(A,n) {
q MY_NUM()
p NUM_PROCS()
for k = 0 to n-2 {
if (alloc(k) == q)
// preparing column k
for i = k+1 to n-1
buffer[i-k-1] aik -aik / akk
broadcast(alloc(k),buffer,n-k-1)
for j = k+1 to n-1
if (alloc(j) == q)
// update of column j
for i=k+1 to n-1
aij aij + buffer[i-k-1] * akj
}
}
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
49
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Dealing with local indices
• Assume that p divides n
• Each processor needs to store r=n/p columns and its
local indices go from 0 to r-1
• After step k, only columns with indices greater than k will
be used
• Simple idea: use a local index, l, that everyone initializes
to 0
• At step k, processor alloc(k) increases its local index so
that next time it will point to its next local column
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
50
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
LU-broadcast algorithm
...
double a[n-1][r-1];
q MY_NUM()
p NUM_PROCS()
l 0
for k = 0 to n-2 {
if (alloc(k) == q)
for i = k+1 to n-1
buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]
l l+1
broadcast(alloc(k),buffer,n-k-1)
for j = l to r-1
for i=k+1 to n-1
a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]
}
}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
51
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Bad load balancing
P1 P2 P3 P4
already
done
already
done working
on it
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
52
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Good Load Balancing?
working
on it
already
done
already
done
Cyclic distribution
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
53
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Load-balanced program
...
double a[n-1][r-1];
q MY_NUM()
p NUM_PROCS()
l 0
for k = 0 to n-2 {
if (k mod p == q)
for i = k+1 to n-1
buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]
l l+1
broadcast(alloc(k),buffer,n-k-1)
for j = l to r-1
for i=k+1 to n-1
a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]
}
}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
54
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Performance Analysis
• How long does this code take to run?– This is not an easy question because there are many tasks and
many communications
• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)
– n-1 column preparations: (n2/2) w’ + O(1)
– column updates: (n3/3p) w + O(n2)
• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)
• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice
• How can we improve this algorithm?
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
55
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Pipelining on the Ring
• So far, in the algorithm we’ve used a simple broadcast
• Nothing was specific to being on a ring of processors and it’s portable – in fact you could just write raw MPI that just looks like our
pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors
• But it’s not efficient– The n-1 communication steps are not overlapped with
computations
– Therefore Amdahl’s law, etc.
• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast
code we saw at the very beginning throughout the LU code
src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
56
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Previous program
...
double a[n-1][r-1];
q MY_NUM()
p NUM_PROCS()
l 0
for k = 0 to n-2 {
if (k == q mod p)
for i = k+1 to n-1
buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]
l l+1
broadcast(alloc(k),buffer,n-k-1)
for j = l to r-1
for i=k+1 to n-1
a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]
}
}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
57
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
LU-pipeline algorithm
double a[n-1][r-1];
q MY_NUM()
p NUM_PROCS()
l 0
for k = 0 to n-2 {
if (k == q mod p)
for i = k+1 to n-1
buffer[i-k-1] a[i,k] -a[i,l] / a[k,l]
l l+1
send(buffer,n-k-1)
else
recv(buffer,n-k-1)
if (q ≠ k-1 mod p) send(buffer, n-k-1)
for j = l to r-1
for i=k+1 to n-1
a[i,j] a[i,j] + buffer[i-k-1] * a[k,j]
}
}src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08
58
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Topics
• Array Decomposition
• Matrix Transpose
• Gauss-Jordan Elimination
• LU Decomposition
• Summary Materials for Test
59
CSC 7600 Lecture 16:Applied Parallel Algorithms 2 Spring 2011
Summary : Material for the Test
• Matrix Transpose: Slides 17-23
• Gauss Jordan: Slides 26-30
• Pivoting: Slides 31-37
• Special Cases (forward & backward substitution): Slide 35
• LU Decomposition 44-58
60