MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta [email protected]

MACI - University of Alberta - April 2001

1

High-Performance Computing

José Nelson AmaralDepartment of Computing Science

University of [email protected]


2

Why High Performance Computing?

Many important problems cannot be solved yet even with the fastest machines available.

faster computers enable the formulation of more interesting questions

when a problem is solved, researchers find bigger problems to tackle!


3

Grand Challenges

weather forecastingeconomic modelingcomputer-aided designdrug designexploring the origins of the universesearching for extra-terrestrial lifecomputer vision


4

Grand Challenges

To simulate the folding of a 300 amino acid protein in water:# of atoms: ~ 32,000folding time: 1 milisecond# of FLOPs: 3 1022 Machine Speed: 1 PetaFLOP/sSimulation Time: 1 year (Source: IBM Blue Gene Project)

IBM’s answer: The Blue Gene ProjectUS$ 100 M of funding to build a1 PetaFLOP/s computer

Ken Dil and Kit Lau’s protein folding model.

Charles L Brooks III, Scripps Research Institute


5

Grand Challenges

In 1996 the GeneCrunch project demonstrates that a cluster of SGI Chalengers (64 processors) delivers near linear speedup for multiple sequence alignment.


6

Commercial Applications

In October 2000, SGI andESI (France) revealed acrash simulator to be usedin the future BMW Series 5.

Sustained performance: 12 GFLOPSProcessors: 96 400 MHz MIPS Machine: SGI Origin 3000 series.


7

Powerful Computers

Increased computing power enablesincreasing problem dimensions

adding more particles to a system increasing accuracy of the result improving experiment turnaround time


8

Speed and Storage


9

Solution?

Instead of using a single processor …

use multiple processors combine their efforts to solve a

problembenefit from the aggregate of their

processing speed, memory, cache and disk storage


10

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks


11


1980

1988

1990

1994

1998

2000


13

Distributed Memory Machine Architecture

Interconnection Network

Processor

Caches

Memory I/O

Processor

Caches

Memory I/O

Processor

Caches

Memory I/O

NonUniform Memory Access (NUMA):Accessing local memory is faster than

accessing remote memory


14

Centralized Shared Memory Multiprocessor


Processor

Caches

Main Memory I/O System

Processor

Caches

Processor

Caches

Processor

Caches


15

Centralized Shared Memory Multiprocessor


Processor

Caches

Mem. Mem. Mem.

Processor

Caches

Processor

Caches

Processor

Caches

I/O crtl I/O crtl

Uniform Memory Address (UMA)“Dance Hall Approach”


16

Distributed Shared Memory(Clusters of SMPs)

Cluster Interconnection Network

Memory I/O

Proc.

Caches

Node Interc. Network

Proc.

Caches

Proc.

Caches

Memory I/O

Proc.

Caches

Node Interc. Network

Proc.

Caches

Proc.

Caches

Typically: Shared Address Space with Non-Uniform Memory Access (NUMA)


17



18

What’s Next?

What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)

Thesis: 1. Clusters are becoming ubiquous, and even traditional data centers are migrating to clusters;2. Grid communities are beginning to provide significant advantages for addressing parallel problems and sharing vast number of files.

“Dark Side of Clusters: Clusters perform poorly on applications that require large shared memory.”


19

Beowulf

Project started at NASA in 1993 with the goal of:

“Implementing a 1 GFLOPs workstation costingless than US$50,000 using commercial off-the-shelf(COTS) hardware and software.”

In 1994 a US$ 40,000 cluster, with 16 Intel 486s reached the goal.

In 1997 a Beowulf cluster won theGordon Bell performance/price Prize.

In June 2001, 28 Beowulfs were in theTop500 fastest computers in the world.


20

“The Dark Side of Clusters”

What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)

“Clusters perform poorly on applications that require large shared memory.”

PAP = Peak Advertised PerformanceRAP = Real Application Performance

Shared memory computers deliver RAP of30-50% of the PAP, while clusters deliver5-15% of the PAP.


21

Non-Shared Address Space

Clusters require an explicit message passing programming model:

MPI is the most widely used parallel programming model today

PVM used in some engineering departments.


22

Large and Expensive Clusters

ASCI White8,192 PowerPC processors6 TB of memory160 TB of disk space12.3 Teraops (peak)28 tractor trailers to transport(July 2000)

Suplier: IBMClient: USA Department of EnergyMain Application: Simulated Testing of Nuclear Weapons Stockpile


23



24

Programming Model Requirements

What data can be named by the threads?

What operations can be performed on the named data?

What ordering exists among these operations?


25

Programming Model Requirements

Naming:Global Physical Address SpaceIndependent Local Physical Address Spaces

Ordering: Mutual ExclusionEventsCommunication X Synchronization


26

Parallel Framework

Layers: Programming Model:

Multiprogramming: lots of jobs, no communication

Shared address space: communicate via memory

Message passing: send and receive messages


27

Message Passing Model

Communicate through explicit I/O operationsEssentially NUMA but integrated at I/O devices vs.

memory system Send specifies local buffer + receiving

process on remote computer Receive specifies sending process on remote

computer + local buffer to place dataSynch: when send completes, when buffer free,

when request accepted, receive wait for send Send+receive => memory-memory copy,

where each supplies local address, AND does pair-wise synchronization!


28

Shared Address Model Summary

Each processor can name every physical location in the machine

Each process can name all data it shares with other processes

Data transfer via load and store Data size: byte, word, ... or cache blocks Uses virtual memory to map virtual to local or

remote physical Memory hierarchy model applies: communication

moves data to local processor cache (as load moves data from memory to cache)


29

Shared Address/Memory Multiprocessor Model

Communicate via Load and Store Oldest and most popular model

process: a virtual address space and ~ 1 thread of control: Multiple processes can overlap (share), but all threads

share the process address space Writes to shared address space by one thread are

visible to reads by other threads


30

Advantages of shared-memory communication

modelCompatibility with SMP hardwareEase of programming

• for complex communication patterns; or • for dynamic communication patterns;

Uses familiar SMP model• attention only on performance critical accesses

Lower communication overhead, • better use of BW for small items• memory mapping implements protection in hardware

HW-controlled caching • reduces remote comm. by caching of all data, both shared

and private.


31

Advantages of message-passing communication

model

The hardware can be simplerCommunication explicit =>

• simpler to understand• focuses attention on costly aspect of parallel

computation

Synchronization is associated with messages• reduces the potential for errors introduced by

incorrect synchronization

Easier to implement sender-initiated communication models, which may have some advantages in performance


32

DataParallelModel

TaskParallelModel

Programming Models

SIMDSingle Instruction

Multiple Data

SPMDSingle ProgramMultiple Data

MPMDMultiple Programs

Multiple Data

SIMDArchitecture

MIMDArchitecture


33

OpenMP (1)

OpenMP gives programmers a “simple” and portable interface for developing shared-memory parallel programs

OpenMP supports C/C++ and Fortran on “all” architectures, including Unix platforms and Windows NT platforms

may become the industry standard


34

OpenMP (2) - C

#pragma omp parallel for shared(A) private(i) for( i = 1; i <= 100; i++) { ... + A[i]; compute A[i] = ... }


35

OpenMP (3) - Fortran

c$omp paralleldo schedule(static)c$omp&shared(omega,error,uold,

u)c$omp&private(i,j,resid)c$omp&reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = calcerror(uold,I,j) u(i,j) = uold(i,j) - omega *

resid error = error + resid*resid end do enddoc$omp end paralleldo


36

Vector Processing (1)

Cray, NEC computersmultiple functional units, each with multiple stagesreplication and pipelined parallelism at the

instruction level


37


for( i=0; i<=N; i++ )for( i=0; i<=N; i++ )A[i] = B[i] * C[i];A[i] = B[i] * C[i];

Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3


38

Multi-threading

OS-level multi-threading: P-threads

Programming Language-level multi-threading: Java

Fine Grain Multi-threading: Threaded-C, Cilk, TAM

Hardware Supported Multi-threading: Tera

Instruction Level Multi-threading: Simultaneous Multi-threading (Compaq-Intel)


39

Other Issues: Debugging

Debugging parallel programs can be frustrating

non-deterministic executionprobe effectdifficult to “stop” a parallel programmultiple core filesdifficult to visualize parallel activity tools are barely adequate


40

Other Issues:Performance Tuning

Use available performance tuning tools (perfex, Speedshop on SGI) to know where the program spends time.

Re-tune code for performance when hardware changes.


41

Other Issues: Fault Tolerance

Consider a job running on 40 processors for a week, then there is a power outage, losing all the work.Long-running jobs must be able to save a program’s state and then be able to restart from that state. This is called check-pointing.


42



43

What Does Coherency Mean?

Informally:• “Any read must return the most recent write”• Too strict and too difficult to implement

Better:• “Any write must eventually be seen by a read”• All writes are seen in proper order

(“serialization”)

Two rules to ensure this:• “If P writes x and P1 reads x, P’s write will be

seen by P1 if the read and write are sufficiently far apart”

Writes to a single location are serialized: seen in the same order order


44

Potential HW Coherency Solutions

Snooping Solution (Snoopy Bus):• Send all requests for data to all processors• Each processor snoops to see if it have a copy• Requires broadcast• Works well with bus (natural broadcast medium)• Prefered scheme for small scale machines

Directory-Based Schemes:• Keep track of what is being shared in 1 centralized

place

Distributed memory => distributed directory• Sends point-to-point requests• Scales better than Snooping


45

Basic Snoopy Protocols

Write Invalidate Protocol:Multiple readers, single writerWrite to shared data: an invalidate is sent to all

caches which snoop and invalidate any copiesRead Miss:

• Write-through: memory is always up-to-date• Write-back: snoop in caches to find most recent copy

Write Broadcast Protocol:Write to shared data: broadcast on bus, processors

snoop, and update any copiesRead miss: memory is always up-to-date

Write serialization: bus serializes requests!Bus is single point of arbitration


46

Basic Snoopy Protocols

Write Invalidate versus Broadcast: Invalidate requires one transaction per

write-run Invalidate uses spatial locality: one

transaction per block Broadcast has lower latency between

write and read


47


Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

read x

Main Memory

read miss

Snoopy, Cache Invalidation

Protocol (Example)


48


Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared


Protocol (Example)


49


Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared

read x

read miss


Protocol (Example)


50


Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared

oxshared


Protocol (Example)


51


Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared

oxshared

write x

invalidate


Protocol (Example)


52


Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

1xexclusive


Protocol (Example)


53

Programmer’s Abstraction for a

Sequential Consistency Model

P1 P1 Pn

Memory

The switch is randomly set after each memory reference.

(See CullerSinghGupta, pp. 287)


54



55

Top 500 (November 10, 2001)

Manuf. Computer Rmax Site Year Proc. Rpeak Rmax/Rpeak

I BM ASCI White, Power3, 375 MHz

7226 Lawrence Livermore 2000 8192 12288 0.59

Compaq AlphaServer SC ES45/ 1 GHz

4059 Pittsburgh Superc. Center

2001 6048 6048 0.67

I BM SP Power3, 375 MHz 16-way

3052 NERSC/ LBNL 2001 3328 4992 0.61

I ntel ASCI Red 2379 Sandia Nat. Lab. 1999 9632 3207 0.74 I BM ASCI Blue-Pacifi c

I BM SP 604e 2144 Lawrence Livermore 1999 5808 3868 0.55

Compaq AlphaServer SC ES45/ 1 GHz

2096 Los Alamos Nat. Lab. 2001 1536 3072 0.68

Hitachi SR8000/ MPP 1709.1 Univ. of Tokyo 2001 1152 2074 0.82 …

NEC(12) SX-5/ 128M8 3.2ns 1192 Osaka University 2001 128 1280 0.93

Rmax = Maximal LINPACK Performance Achieved [GFLOPS]Rpeak = Theoretical peak performance [GFLOPS]

Canada appears in ranks 123, 144, 183, 255, 266, 280, 308, 311, 315, 414, 419.


56

Top500 Statistics

HP SPP

IBM SP

SGI Origin

T3E/T3D

NOW

Sun UltraHPC

Others

Industry

Research

Academic

Classified

Vendor

Government

MPP

Constellations

SMP

Clusters

USA

Germany

Japan

UK

France

Korea

Italy

Canada

Others

Computer Family Type of Organization

Machine Organization Country


57



58

Intel Architecture 64

Itanium, the first one, is out….

… but we are still waiting for Mckinley...

Will we get Yamhill (Intel’s Plan B) instead?


59

Alpha is gone...

So is Compaq! EV8 is scrapped.

Compaq designers split betweenAMD and Intel.

Intel converts to the SymultaneousMultithreading religion, but renamesit: Hyperthreading!!


60

IBM’s POWER4 is the 2001/2002 winner

Best Floating Point and integer performanceavailable in the market.

Highest memory bandwidth in the industry.

Well integrated cache coherence mechanism.


61

Implicit X ExplicitInstruction Level Parallelism

EPIC: Explicitly ParallelInstruction Computer

Superscalar: Instruction LevelParallelism discovered

implicitly by compiler or by hardware.


62

Instruction Level Parallelism

Most ILP is implicit, i.e., instructions that can beexecuted in parallel are automatically discoveredby the hardware at runtime.

Intel launched Itanium, the first IA-64 processor,that explores explicit parallelism at the instructionlevel, in this processors the compiler codes paralleloperations in the assembly code.

But applications are still written in standard sequentialprogramming languages.


63

Instruction Level Parallelism

For example, in the IA-64 an instruction group, identified by the compiler, is a set of instructions that

have no read after write (RAW) or write after write (WAW)register dependencies (they can execute in parallel).

Consecutive instruction groups are separated by stops (represented by a double semi-column in the assembly code).

ld8 r1=[r5] // First groupsub r6=r8, r9 // First groupadd r3=r1,r4 ;; // First groupst8 [r6]=r12 // Second group


64

IA-64 Innovations

if-conversion: execute both sides of a branch

if(r1 == 0)r2 = r3 + r3

elser7 = r6 - r5

cmp.ne p1, p2 = r1, 0 ;; Set predicate reg(p1) add r2 = r3, r4(p2) sub r7 = r6,r5

data speculation: load a value before knowing if the addressis correct.

control speculation: Execute a computation before knowing it it have to.

rotating registers: Support for software pipelining.


65



66

Below Above the line

for(n=0 ; …) for(f=0 ; …) for(t=0 ; …) for(x=0 ; …) for(y=0 ; …) for(z=0 ; …) { …….. }

Application LevelParallelism

AutomaticParallelism


67

Some Common Loop Optimizations

UnswitchingLoop PeelingLoop AlignmentIndex Set SplittingScalar ExpansionLoop FusionLoop FissionLoop ReversalLoop Interchange


68

Unswitching

Remove loop independent conditionals from a loop.

for i=1 to N do for j=2 to N do if T[i] > 0 then A[i,j] = A[i, j-1]*T[i] + B[i] else A[i,j] = 0.0 endif endforendfor

Before Unswitching

for i=1 to N do if T[i] > 0 then for j=2 to N do A[i,j] = A[i, j-1]*T[i] + B[i] endfor else for j=2 to N do A[i,j] = 0.0 enfor endifendfor

After Unswitching


69

Loop Peeling

Remove the first (last) iteration of the loop into separate code.

for i=1 to N do A[i] = (X+Y)*B[i]endfor

Before Peeling

if N >= 1 then A[i] = (X+Y)*B[i] for j=2 to N do A[i] = (X+Y)*B[i] enforendif

After Peeling


70

Index Set Splitting

Divides the index set into two portions.

for i=1 to 100 do A[i] = B[i] + C[i] if i > 10 then D[i] = A[i] + A[i-10] endifendfor

Before Set Splitting

for i=1 to 10 do A[i] = B[i] + C[i]endforfor i=11 to 100 do A[i] = B[i] + C[i] D[i] = A[i] + A[i-10]endfor

After Set Splitting


71

Scalar Expansion

Breaks anti-dependence relations by expanding, or promoting a scalar into an array.

for i=1 to N do T = A[i] + B[i] C[i] = T + 1/Tendfor

Before Scalar Expansion

if N >= 1 then allocate Tx(1:N) for i=1 to N do Tx[i] = A[i] + B[i] C[i] = Tx[i] + 1/Tx[i] endfor T = Tx[N]endif

After Scalar Expansion


72

Loop Fusion

Takes two adjacent loops and generates a singleloop.

(1) for i=1 to N do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to N do(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor

Before Loop Fusion

(1) for i=1 to N do(2) A[i] = B[i] + 1(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor

After Loop Fusion


73

Loop Fusion (Another Example)

(1) for i=1 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor

(2) A[1] = B[1] + 1(1) for i=2 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor

(1) i = 1(2) A[i] = B[i] + 1 for ib=0 to 97 do(1) i = ib+2(2) A[i] = B[i] + 1(4) i = ib+1(5) C[i] = A[i+1] * 2(6) endfor


74

Loop Fission

Breaks a loop into two or more smaller loops.

(1) for i=1 to N do(2) A[i] = A[i] + B[i-1](3) B[i] = C[i-1]*X + Z(4) C[i] = 1/B[i](5) D[i] = sqrt(C[i])(6) endfor

Original Loop

(1) for ib=0 to N-1 do(3) B[ib+1] = C[ib]*X + Z(4) C[ib+1] = 1/B[ib+1](6) endfor(1) for ib=0 to N-1 do(2) A[ib+1] = A[ib+1] + B[ib](6) endfor(1) for ib=0 to N-1 do(5) D[ib+1] = sqrt(C[ib+1])(6) endfor(1) i = N+1

After Loop Fission


75

Loop Reversal

Run a loop backward.All dependence directions are reversed.

It is only legal for loops that have no loop carrieddependences.

Can be used to allow fusion(1) for i=1 to N do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=1 to N do(6) D[i] = 1/C[i+1](7) endfor

(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=N downto 1 do(6) D[i] = 1/C[i+1](7) endfor

(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(6) D[i] = 1/C[i+1](7) endfor


76

Loop Interchanging

(1) for j=2 to M do(2) for i=1 to N do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor

(1) for i=1 to N do(2) for j=2 to M do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor


77



78

Speedup

goal is to use N processors to make a program run N times faster

speedup is the factor by which the program’s speed improves

processor 1ePerformanc

processor ePerformanc processors Speedup

pp

processor Time

processor 1Time processors Speedup

pp


79

AbsoluteRelative Speedup

Careful: the execution time depends on what the program does!

A parallel program spends time in: Work Synchronization Communication Extra work

A program implemented for a parallel machine is likely to do extra work (than a sequential program) even when running in a single processor machine!


80

Absolute Relative Speedup

When talking about execution time, ask what algorithm is implemented!

processor Alg., Par.Time

processor 1 Alg., Par.Time processors Speedup Relative

pp

processor 1 Alg., Par.Time

processor 1 Alg., Seq.Time

processor Alg., Par.Time

processor 1 Alg., Seq.Time processors Speedup Absolute

pp


81

Speedup


82

Which is Better?

programs A & B solve the same problem using different algorithmsboth are run on a 100-processor computerprogram A gets a 90-fold speedup program B gets a 10-fold speedupWhich one would you prefer to use?


83

It Depends!

all that matters is overall execution timewhat if A runs sequentially

1,000 times slower than B?always use the best sequential time (over all algorithms) for

computing speedups!and the best compiler!


84

Superlinear Speedups

sometimes N processors can achieve a speedup > N

usually the result of improving an inferior sequential algorithm

can legitimately occur because of cache and memory effects


85

Amdahl’s Law (1)

Npar

seq

parseq

NT

TSpeedup

N

par

seq

1

processors ofnumber :

program a ofportion parallel:

program a ofportion sequential:

seqseq

parseq MaxSpeedup

0.1

:is obtained becan that speedup maximum The


86

Amdahl’s Law (2)

= % seq timeN


87

Scalability

desirable property of algorithm is scalability, regardless of speedup

problem of size P using N processors takes time T

problem is scalable if problem of size 2P on 2N processors still takes time T


88

Choose Right Algorithm

understand strengths and weaknesses of hardware being used

choose an algorithm to exploit strengths and avoid weaknessesExample: there are many parallel sorting algorithms, each valid for different hardware/application properties


89



90

The Reality Is ...

software – more issues to be addressed beyond sequential case

hardware – need to understand machine architecture

Parallel programming is hard!

The Reward Is ...high performance – the motivation

in the first place!


91

Do You Need Parallelism?

Consider the trade-offs+Potentially faster execution

turnaround- Longer software development time- Obtaining machine access- Cost = f(development) + g(execution

time)Do the benefits out-weigh the costs?


92

Resistance to Parallelism

software inertia – cost of converting existing software

hardware inertia – waiting for faster machineslack of educationlimited access to resources


93

Starting Out...

parallel program design is black magicsoftware tools are primitiveexperience is an asset

All is not lost!many problems are amenable

to simple parallel solutions


94

Starting Out...

sequential world is simple: one architectural model

in parallel world, need to choose the algorithm to suit the architecture

parallel algorithm may only perform well on one class of machine


95

Granularity (1)

Granularity is the relation between the amount of computation and the amount of communication.

It is a measure of how much work gets done before processes have to communicate.


96

Granularity (2)

Problem: Shopping for 100 items of grocery.

Scenario 2: (large granularity)

You are told all 100 items, and then you make a single trip to purchase everything.

Scenario 1: (small granularity)

You are told an item to buy, go to the store, purchase the item, then return home to find out the next item to buy.


97

Granularity (3)

As in the real world, communication and synchronization takes time


98

ArchitecturesMatch problem granularity to parallel architecture

fine-grained vector/array processors

medium-grained shared memory multiprocessor

coarse/large-grained network of workstations


99

Program Design

1) Identify hardware platforms available2) Identify parallelism in the application3) Choose right type of algorithmic

parallelism to match the architecture4) Implement algorithm, being wary of

performance and correctness issues


100


this class of machines is very effective at striding through arrays

some parallelism can be automatically detected by compiler

there is “right” way and “wrong” way to code loops to allow parallelism


101

Distributed Memory (1)Loosely coupled processorsIBM SP-2, networks of computersPCs connected by a fast network


102

Distributed Memory (2)

communication between processes by sending messagesOverhead of parallelism includes

cost of preparing a messagecost of sending a messagecost of waiting for a message


103

Communication

distributed memory(loosely coupled)

Explicitly sendmessages between processes


104

Synchronization


105

Message Passing (1)Process 1 Process 2compute computeSend( P2, info ); computecompute Receive( P1, info );idle computeidle Send( P1, reply );Receive( P2, reply );

SynchronizeSynchronize

CommunicateCommunicate


106

Message Passing (2)

Two popular message passing libraries:PVM (Parallel Virtual Machine)MPI (Message Passing Interface)MPI will likely be the industry standardBoth are easy to use, but verbose


107

Master/Slave (1)

many distributed memory programs are structured as one master and N slaves

master generates work to dowhen idle, slave asks for a piece of work, gets it, does the

task, and reports the result


108

Master/Slave (2)Master Slaveworklist = Make( work ) data = NEED_WORK;while( worklist != empty ) while( true ){ { Receive( slave, result ); Send( master, data ); to_do = Head( worklist ); /* wait */ Send( slave, to_do ) Receive( master, work ); Process( result ); data = Process( work );} }

M

S

S

S


109

Pitfall: Deadlock

scenario by which no further progress can be made, since each process is waiting on a cyclic dependency


110

Pitfall: Deadlock

A real-world problem!


111

Pitfall: Load Balancing

need to assign a roughly equal portion of work to each processor

load imbalances can result in poor speedups


112

Shared Memory (1)

Tightly coupled multiprocessorsSGI Origin 2400, SUN E10000Classified by memory access times

Same (SMP - symmetric multiprocessor) Different (NUMA - non-uniform memory

access)

P P P P

Memory


113

Shared Memory (2)

communicate between processes through reading/writing shared variables (instead of sending and receiving messages)Overhead of parallelism includes:

contention for shared resourceprotecting integrity of shared resources


114

Communication

shared memory(tightly coupled)

Read from Read from and write to and write to shared datashared data


115

SynchronizationMay have to prevent simultaneous access!Process 1 Process 2B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Print Statement; Print Statement;What is the value of BankBalance?


116

Pitfall: Shared Data Access

Need to restrict access to data!Avoid race conditions!Process 1 Process 2Lock( access ); Lock( access );B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Unlock( access ); Unlock( access );


117

Multi-threading


118

Simultaneous Multi-threading

http://www.eet.com/story/0EG19991008S0014 by Rick Merrit


119

Top 500

http://www.top500.org


120

Top 500



121

Top 500



122

Top 500



123

Top 500



124

Top 500



125

Top 500



126

Conclusions

some problems require extensive computational resources

parallelism allows you to decrease experiment turnaround time

the tools are adequate but still have a long way to go

“simple” parallelism gets some performance but maximum performance requires effort.

Performance commensurate with effort!


127

Reminders

understand your computational needs

understand the hardware and software resources available to you

match parallelism to the architecture

maximize utilization, don’t waste cycles!

granularity, granularity, granularity

develop, test and debug small data sets before trying large ones

be wary of the many pitfalls


128

We Want You!

For help with parallel programming, contact

[email protected]

Get parallel!Become part of MACI

Documents

MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta [email protected]