128
MACI - University of Alb erta - April 2001 1 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta [email protected]

MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta [email protected]

Embed Size (px)

Citation preview

Page 1: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

1

High-Performance Computing

José Nelson AmaralDepartment of Computing Science

University of [email protected]

Page 2: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

2

Why High Performance Computing?

Many important problems cannot be solved yet even with the fastest machines available.

faster computers enable the formulation of more interesting questions

when a problem is solved, researchers find bigger problems to tackle!

Page 3: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

3

Grand Challenges

weather forecastingeconomic modelingcomputer-aided designdrug designexploring the origins of the universesearching for extra-terrestrial lifecomputer vision

Page 4: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

4

Grand Challenges

To simulate the folding of a 300 amino acid protein in water:# of atoms: ~ 32,000folding time: 1 milisecond# of FLOPs: 3 1022 Machine Speed: 1 PetaFLOP/sSimulation Time: 1 year (Source: IBM Blue Gene Project)

IBM’s answer: The Blue Gene ProjectUS$ 100 M of funding to build a1 PetaFLOP/s computer

Ken Dil and Kit Lau’s protein folding model.

Charles L Brooks III, Scripps Research Institute

Page 5: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

5

Grand Challenges

In 1996 the GeneCrunch project demonstrates that a cluster of SGI Chalengers (64 processors) delivers near linear speedup for multiple sequence alignment.

Page 6: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

6

Commercial Applications

In October 2000, SGI andESI (France) revealed acrash simulator to be usedin the future BMW Series 5.

Sustained performance: 12 GFLOPSProcessors: 96 400 MHz MIPS Machine: SGI Origin 3000 series.

Page 7: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

7

Powerful Computers

Increased computing power enablesincreasing problem dimensions

adding more particles to a system increasing accuracy of the result improving experiment turnaround time

Page 8: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

8

Speed and Storage

Page 9: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

9

Solution?

Instead of using a single processor …

use multiple processors combine their efforts to solve a

problembenefit from the aggregate of their

processing speed, memory, cache and disk storage

Page 10: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

10

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 11: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

11

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 12: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

1980

1988

1990

1994

1998

2000

Page 13: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

13

Distributed Memory Machine Architecture

Interconnection Network

Processor

Caches

Memory I/O

Processor

Caches

Memory I/O

Processor

Caches

Memory I/O

NonUniform Memory Access (NUMA):Accessing local memory is faster than

accessing remote memory

Page 14: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

14

Centralized Shared Memory Multiprocessor

Interconnection Network

Processor

Caches

Main Memory I/O System

Processor

Caches

Processor

Caches

Processor

Caches

Page 15: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

15

Centralized Shared Memory Multiprocessor

Interconnection Network

Processor

Caches

Mem. Mem. Mem.

Processor

Caches

Processor

Caches

Processor

Caches

I/O crtl I/O crtl

Uniform Memory Address (UMA)“Dance Hall Approach”

Page 16: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

16

Distributed Shared Memory(Clusters of SMPs)

Cluster Interconnection Network

Memory I/O

Proc.

Caches

Node Interc. Network

Proc.

Caches

Proc.

Caches

Memory I/O

Proc.

Caches

Node Interc. Network

Proc.

Caches

Proc.

Caches

Typically: Shared Address Space with Non-Uniform Memory Access (NUMA)

Page 17: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

17

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 18: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

18

What’s Next?

What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)

Thesis: 1. Clusters are becoming ubiquous, and even traditional data centers are migrating to clusters;2. Grid communities are beginning to provide significant advantages for addressing parallel problems and sharing vast number of files.

“Dark Side of Clusters: Clusters perform poorly on applications that require large shared memory.”

Page 19: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

19

Beowulf

Project started at NASA in 1993 with the goal of:

“Implementing a 1 GFLOPs workstation costingless than US$50,000 using commercial off-the-shelf(COTS) hardware and software.”

In 1994 a US$ 40,000 cluster, with 16 Intel 486s reached the goal.

In 1997 a Beowulf cluster won theGordon Bell performance/price Prize.

In June 2001, 28 Beowulfs were in theTop500 fastest computers in the world.

Page 20: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

20

“The Dark Side of Clusters”

What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)

“Clusters perform poorly on applications that require large shared memory.”

PAP = Peak Advertised PerformanceRAP = Real Application Performance

Shared memory computers deliver RAP of30-50% of the PAP, while clusters deliver5-15% of the PAP.

Page 21: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

21

Non-Shared Address Space

Clusters require an explicit message passing programming model:

MPI is the most widely used parallel programming model today

PVM used in some engineering departments.

Page 22: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

22

Large and Expensive Clusters

ASCI White8,192 PowerPC processors6 TB of memory160 TB of disk space12.3 Teraops (peak)28 tractor trailers to transport(July 2000)

Suplier: IBMClient: USA Department of EnergyMain Application: Simulated Testing of Nuclear Weapons Stockpile

Page 23: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

23

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 24: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

24

Programming Model Requirements

What data can be named by the threads?

What operations can be performed on the named data?

What ordering exists among these operations?

Page 25: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

25

Programming Model Requirements

Naming:Global Physical Address SpaceIndependent Local Physical Address Spaces

Ordering: Mutual ExclusionEventsCommunication X Synchronization

Page 26: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

26

Parallel Framework

Layers: Programming Model:

Multiprogramming: lots of jobs, no communication

Shared address space: communicate via memory

Message passing: send and receive messages

Page 27: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

27

Message Passing Model

Communicate through explicit I/O operationsEssentially NUMA but integrated at I/O devices vs.

memory system Send specifies local buffer + receiving

process on remote computer Receive specifies sending process on remote

computer + local buffer to place dataSynch: when send completes, when buffer free,

when request accepted, receive wait for send Send+receive => memory-memory copy,

where each supplies local address, AND does pair-wise synchronization!

Page 28: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

28

Shared Address Model Summary

Each processor can name every physical location in the machine

Each process can name all data it shares with other processes

Data transfer via load and store Data size: byte, word, ... or cache blocks Uses virtual memory to map virtual to local or

remote physical Memory hierarchy model applies: communication

moves data to local processor cache (as load moves data from memory to cache)

Page 29: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

29

Shared Address/Memory Multiprocessor Model

Communicate via Load and Store Oldest and most popular model

process: a virtual address space and ~ 1 thread of control: Multiple processes can overlap (share), but all threads

share the process address space Writes to shared address space by one thread are

visible to reads by other threads

Page 30: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

30

Advantages of shared-memory communication

modelCompatibility with SMP hardwareEase of programming

• for complex communication patterns; or • for dynamic communication patterns;

Uses familiar SMP model• attention only on performance critical accesses

Lower communication overhead, • better use of BW for small items• memory mapping implements protection in hardware

HW-controlled caching • reduces remote comm. by caching of all data, both shared

and private.

Page 31: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

31

Advantages of message-passing communication

model

The hardware can be simplerCommunication explicit =>

• simpler to understand• focuses attention on costly aspect of parallel

computation

Synchronization is associated with messages• reduces the potential for errors introduced by

incorrect synchronization

Easier to implement sender-initiated communication models, which may have some advantages in performance

Page 32: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

32

DataParallelModel

TaskParallelModel

Programming Models

SIMDSingle Instruction

Multiple Data

SPMDSingle ProgramMultiple Data

MPMDMultiple Programs

Multiple Data

SIMDArchitecture

MIMDArchitecture

Page 33: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

33

OpenMP (1)

OpenMP gives programmers a “simple” and portable interface for developing shared-memory parallel programs

OpenMP supports C/C++ and Fortran on “all” architectures, including Unix platforms and Windows NT platforms

may become the industry standard

Page 34: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

34

OpenMP (2) - C

#pragma omp parallel for shared(A) private(i) for( i = 1; i <= 100; i++) { ... + A[i]; compute A[i] = ... }

Page 35: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

35

OpenMP (3) - Fortran

c$omp paralleldo schedule(static)c$omp&shared(omega,error,uold,

u)c$omp&private(i,j,resid)c$omp&reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = calcerror(uold,I,j) u(i,j) = uold(i,j) - omega *

resid error = error + resid*resid end do enddoc$omp end paralleldo

Page 36: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

36

Vector Processing (1)

Cray, NEC computersmultiple functional units, each with multiple stagesreplication and pipelined parallelism at the

instruction level

Page 37: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

37

Vector Processing (2)

for( i=0; i<=N; i++ )for( i=0; i<=N; i++ )A[i] = B[i] * C[i];A[i] = B[i] * C[i];

Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3

Page 38: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

38

Multi-threading

OS-level multi-threading: P-threads

Programming Language-level multi-threading: Java

Fine Grain Multi-threading: Threaded-C, Cilk, TAM

Hardware Supported Multi-threading: Tera

Instruction Level Multi-threading: Simultaneous Multi-threading (Compaq-Intel)

Page 39: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

39

Other Issues: Debugging

Debugging parallel programs can be frustrating

non-deterministic executionprobe effectdifficult to “stop” a parallel programmultiple core filesdifficult to visualize parallel activity tools are barely adequate

Page 40: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

40

Other Issues:Performance Tuning

Use available performance tuning tools (perfex, Speedshop on SGI) to know where the program spends time.

Re-tune code for performance when hardware changes.

Page 41: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

41

Other Issues: Fault Tolerance

Consider a job running on 40 processors for a week, then there is a power outage, losing all the work.Long-running jobs must be able to save a program’s state and then be able to restart from that state. This is called check-pointing.

Page 42: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

42

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 43: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

43

What Does Coherency Mean?

Informally:• “Any read must return the most recent write”• Too strict and too difficult to implement

Better:• “Any write must eventually be seen by a read”• All writes are seen in proper order

(“serialization”)

Two rules to ensure this:• “If P writes x and P1 reads x, P’s write will be

seen by P1 if the read and write are sufficiently far apart”

Writes to a single location are serialized: seen in the same order order

Page 44: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

44

Potential HW Coherency Solutions

Snooping Solution (Snoopy Bus):• Send all requests for data to all processors• Each processor snoops to see if it have a copy• Requires broadcast• Works well with bus (natural broadcast medium)• Prefered scheme for small scale machines

Directory-Based Schemes:• Keep track of what is being shared in 1 centralized

place

Distributed memory => distributed directory• Sends point-to-point requests• Scales better than Snooping

Page 45: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

45

Basic Snoopy Protocols

Write Invalidate Protocol:Multiple readers, single writerWrite to shared data: an invalidate is sent to all

caches which snoop and invalidate any copiesRead Miss:

• Write-through: memory is always up-to-date• Write-back: snoop in caches to find most recent copy

Write Broadcast Protocol:Write to shared data: broadcast on bus, processors

snoop, and update any copiesRead miss: memory is always up-to-date

Write serialization: bus serializes requests!Bus is single point of arbitration

Page 46: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

46

Basic Snoopy Protocols

Write Invalidate versus Broadcast: Invalidate requires one transaction per

write-run Invalidate uses spatial locality: one

transaction per block Broadcast has lower latency between

write and read

Page 47: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

47

Interconnection Network

Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

read x

Main Memory

read miss

Snoopy, Cache Invalidation

Protocol (Example)

Page 48: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

48

Interconnection Network

Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared

Snoopy, Cache Invalidation

Protocol (Example)

Page 49: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

49

Interconnection Network

Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared

read x

read miss

Snoopy, Cache Invalidation

Protocol (Example)

Page 50: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

50

Interconnection Network

Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared

oxshared

Snoopy, Cache Invalidation

Protocol (Example)

Page 51: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

51

Interconnection Network

Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

oxshared

oxshared

write x

invalidate

Snoopy, Cache Invalidation

Protocol (Example)

Page 52: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

52

Interconnection Network

Processor0

I/O System

Processor1

Processor2

ProcessorN-1

ox

Main Memory

1xexclusive

Snoopy, Cache Invalidation

Protocol (Example)

Page 53: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

53

Programmer’s Abstraction for a

Sequential Consistency Model

P1 P1 Pn

Memory

The switch is randomly set after each memory reference.

(See CullerSinghGupta, pp. 287)

Page 54: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

54

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 55: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

55

Top 500 (November 10, 2001)

Manuf. Computer Rmax Site Year Proc. Rpeak Rmax/Rpeak

I BM ASCI White, Power3, 375 MHz

7226 Lawrence Livermore 2000 8192 12288 0.59

Compaq AlphaServer SC ES45/ 1 GHz

4059 Pittsburgh Superc. Center

2001 6048 6048 0.67

I BM SP Power3, 375 MHz 16-way

3052 NERSC/ LBNL 2001 3328 4992 0.61

I ntel ASCI Red 2379 Sandia Nat. Lab. 1999 9632 3207 0.74 I BM ASCI Blue-Pacifi c

I BM SP 604e 2144 Lawrence Livermore 1999 5808 3868 0.55

Compaq AlphaServer SC ES45/ 1 GHz

2096 Los Alamos Nat. Lab. 2001 1536 3072 0.68

Hitachi SR8000/ MPP 1709.1 Univ. of Tokyo 2001 1152 2074 0.82 …

NEC(12) SX-5/ 128M8 3.2ns 1192 Osaka University 2001 128 1280 0.93

Rmax = Maximal LINPACK Performance Achieved [GFLOPS]Rpeak = Theoretical peak performance [GFLOPS]

Canada appears in ranks 123, 144, 183, 255, 266, 280, 308, 311, 315, 414, 419.

Page 56: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

56

Top500 Statistics

HP SPP

IBM SP

SGI Origin

T3E/T3D

NOW

Sun UltraHPC

Others

Industry

Research

Academic

Classified

Vendor

Government

MPP

Constellations

SMP

Clusters

USA

Germany

Japan

UK

France

Korea

Italy

Canada

Others

Computer Family Type of Organization

Machine Organization Country

Page 57: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

57

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 58: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

58

Intel Architecture 64

Itanium, the first one, is out….

… but we are still waiting for Mckinley...

Will we get Yamhill (Intel’s Plan B) instead?

Page 59: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

59

Alpha is gone...

So is Compaq! EV8 is scrapped.

Compaq designers split betweenAMD and Intel.

Intel converts to the SymultaneousMultithreading religion, but renamesit: Hyperthreading!!

Page 60: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

60

IBM’s POWER4 is the 2001/2002 winner

Best Floating Point and integer performanceavailable in the market.

Highest memory bandwidth in the industry.

Well integrated cache coherence mechanism.

Page 61: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

61

Implicit X ExplicitInstruction Level Parallelism

EPIC: Explicitly ParallelInstruction Computer

Superscalar: Instruction LevelParallelism discovered

implicitly by compiler or by hardware.

Page 62: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

62

Instruction Level Parallelism

Most ILP is implicit, i.e., instructions that can beexecuted in parallel are automatically discoveredby the hardware at runtime.

Intel launched Itanium, the first IA-64 processor,that explores explicit parallelism at the instructionlevel, in this processors the compiler codes paralleloperations in the assembly code.

But applications are still written in standard sequentialprogramming languages.

Page 63: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

63

Instruction Level Parallelism

For example, in the IA-64 an instruction group, identified by the compiler, is a set of instructions that

have no read after write (RAW) or write after write (WAW)register dependencies (they can execute in parallel).

Consecutive instruction groups are separated by stops (represented by a double semi-column in the assembly code).

ld8 r1=[r5] // First groupsub r6=r8, r9 // First groupadd r3=r1,r4 ;; // First groupst8 [r6]=r12 // Second group

Page 64: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

64

IA-64 Innovations

if-conversion: execute both sides of a branch

if(r1 == 0)r2 = r3 + r3

elser7 = r6 - r5

cmp.ne p1, p2 = r1, 0 ;; Set predicate reg(p1) add r2 = r3, r4(p2) sub r7 = r6,r5

data speculation: load a value before knowing if the addressis correct.

control speculation: Execute a computation before knowing it it have to.

rotating registers: Support for software pipelining.

Page 65: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

65

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 66: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

66

Below Above the line

for(n=0 ; …) for(f=0 ; …) for(t=0 ; …) for(x=0 ; …) for(y=0 ; …) for(z=0 ; …) { …….. }

Application LevelParallelism

AutomaticParallelism

Page 67: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

67

Some Common Loop Optimizations

UnswitchingLoop PeelingLoop AlignmentIndex Set SplittingScalar ExpansionLoop FusionLoop FissionLoop ReversalLoop Interchange

Page 68: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

68

Unswitching

Remove loop independent conditionals from a loop.

for i=1 to N do for j=2 to N do if T[i] > 0 then A[i,j] = A[i, j-1]*T[i] + B[i] else A[i,j] = 0.0 endif endforendfor

Before Unswitching

for i=1 to N do if T[i] > 0 then for j=2 to N do A[i,j] = A[i, j-1]*T[i] + B[i] endfor else for j=2 to N do A[i,j] = 0.0 enfor endifendfor

After Unswitching

Page 69: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

69

Loop Peeling

Remove the first (last) iteration of the loop into separate code.

for i=1 to N do A[i] = (X+Y)*B[i]endfor

Before Peeling

if N >= 1 then A[i] = (X+Y)*B[i] for j=2 to N do A[i] = (X+Y)*B[i] enforendif

After Peeling

Page 70: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

70

Index Set Splitting

Divides the index set into two portions.

for i=1 to 100 do A[i] = B[i] + C[i] if i > 10 then D[i] = A[i] + A[i-10] endifendfor

Before Set Splitting

for i=1 to 10 do A[i] = B[i] + C[i]endforfor i=11 to 100 do A[i] = B[i] + C[i] D[i] = A[i] + A[i-10]endfor

After Set Splitting

Page 71: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

71

Scalar Expansion

Breaks anti-dependence relations by expanding, or promoting a scalar into an array.

for i=1 to N do T = A[i] + B[i] C[i] = T + 1/Tendfor

Before Scalar Expansion

if N >= 1 then allocate Tx(1:N) for i=1 to N do Tx[i] = A[i] + B[i] C[i] = Tx[i] + 1/Tx[i] endfor T = Tx[N]endif

After Scalar Expansion

Page 72: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

72

Loop Fusion

Takes two adjacent loops and generates a singleloop.

(1) for i=1 to N do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to N do(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor

Before Loop Fusion

(1) for i=1 to N do(2) A[i] = B[i] + 1(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor

After Loop Fusion

Page 73: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

73

Loop Fusion (Another Example)

(1) for i=1 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor

(2) A[1] = B[1] + 1(1) for i=2 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor

(1) i = 1(2) A[i] = B[i] + 1 for ib=0 to 97 do(1) i = ib+2(2) A[i] = B[i] + 1(4) i = ib+1(5) C[i] = A[i+1] * 2(6) endfor

Page 74: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

74

Loop Fission

Breaks a loop into two or more smaller loops.

(1) for i=1 to N do(2) A[i] = A[i] + B[i-1](3) B[i] = C[i-1]*X + Z(4) C[i] = 1/B[i](5) D[i] = sqrt(C[i])(6) endfor

Original Loop

(1) for ib=0 to N-1 do(3) B[ib+1] = C[ib]*X + Z(4) C[ib+1] = 1/B[ib+1](6) endfor(1) for ib=0 to N-1 do(2) A[ib+1] = A[ib+1] + B[ib](6) endfor(1) for ib=0 to N-1 do(5) D[ib+1] = sqrt(C[ib+1])(6) endfor(1) i = N+1

After Loop Fission

Page 75: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

75

Loop Reversal

Run a loop backward.All dependence directions are reversed.

It is only legal for loops that have no loop carrieddependences.

Can be used to allow fusion(1) for i=1 to N do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=1 to N do(6) D[i] = 1/C[i+1](7) endfor

(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=N downto 1 do(6) D[i] = 1/C[i+1](7) endfor

(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(6) D[i] = 1/C[i+1](7) endfor

Page 76: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

76

Loop Interchanging

(1) for j=2 to M do(2) for i=1 to N do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor

(1) for i=1 to N do(2) for j=2 to M do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor

Page 77: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

77

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 78: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

78

Speedup

goal is to use N processors to make a program run N times faster

speedup is the factor by which the program’s speed improves

processor 1ePerformanc

processor ePerformanc processors Speedup

pp

processor Time

processor 1Time processors Speedup

pp

Page 79: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

79

AbsoluteRelative Speedup

Careful: the execution time depends on what the program does!

A parallel program spends time in: Work Synchronization Communication Extra work

A program implemented for a parallel machine is likely to do extra work (than a sequential program) even when running in a single processor machine!

Page 80: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

80

Absolute Relative Speedup

When talking about execution time, ask what algorithm is implemented!

processor Alg., Par.Time

processor 1 Alg., Par.Time processors Speedup Relative

pp

processor 1 Alg., Par.Time

processor 1 Alg., Seq.Time

processor Alg., Par.Time

processor 1 Alg., Seq.Time processors Speedup Absolute

pp

Page 81: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

81

Speedup

Page 82: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

82

Which is Better?

programs A & B solve the same problem using different algorithmsboth are run on a 100-processor computerprogram A gets a 90-fold speedup program B gets a 10-fold speedupWhich one would you prefer to use?

Page 83: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

83

It Depends!

all that matters is overall execution timewhat if A runs sequentially

1,000 times slower than B?always use the best sequential time (over all algorithms) for

computing speedups!and the best compiler!

Page 84: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

84

Superlinear Speedups

sometimes N processors can achieve a speedup > N

usually the result of improving an inferior sequential algorithm

can legitimately occur because of cache and memory effects

Page 85: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

85

Amdahl’s Law (1)

Npar

seq

parseq

NT

TSpeedup

N

par

seq

1

processors ofnumber :

program a ofportion parallel:

program a ofportion sequential:

seqseq

parseq MaxSpeedup

0.1

:is obtained becan that speedup maximum The

Page 86: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

86

Amdahl’s Law (2)

= % seq timeN

Page 87: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

87

Scalability

desirable property of algorithm is scalability, regardless of speedup

problem of size P using N processors takes time T

problem is scalable if problem of size 2P on 2N processors still takes time T

Page 88: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

88

Choose Right Algorithm

understand strengths and weaknesses of hardware being used

choose an algorithm to exploit strengths and avoid weaknessesExample: there are many parallel sorting algorithms, each valid for different hardware/application properties

Page 89: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

89

This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks

Page 90: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

90

The Reality Is ...

software – more issues to be addressed beyond sequential case

hardware – need to understand machine architecture

Parallel programming is hard!

The Reward Is ...high performance – the motivation

in the first place!

Page 91: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

91

Do You Need Parallelism?

Consider the trade-offs+Potentially faster execution

turnaround- Longer software development time- Obtaining machine access- Cost = f(development) + g(execution

time)Do the benefits out-weigh the costs?

Page 92: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

92

Resistance to Parallelism

software inertia – cost of converting existing software

hardware inertia – waiting for faster machineslack of educationlimited access to resources

Page 93: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

93

Starting Out...

parallel program design is black magicsoftware tools are primitiveexperience is an asset

All is not lost!many problems are amenable

to simple parallel solutions

Page 94: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

94

Starting Out...

sequential world is simple: one architectural model

in parallel world, need to choose the algorithm to suit the architecture

parallel algorithm may only perform well on one class of machine

Page 95: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

95

Granularity (1)

Granularity is the relation between the amount of computation and the amount of communication.

It is a measure of how much work gets done before processes have to communicate.

Page 96: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

96

Granularity (2)

Problem: Shopping for 100 items of grocery.

Scenario 2: (large granularity)

You are told all 100 items, and then you make a single trip to purchase everything.

Scenario 1: (small granularity)

You are told an item to buy, go to the store, purchase the item, then return home to find out the next item to buy.

Page 97: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

97

Granularity (3)

As in the real world, communication and synchronization takes time

Page 98: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

98

ArchitecturesMatch problem granularity to parallel architecture

fine-grained vector/array processors

medium-grained shared memory multiprocessor

coarse/large-grained network of workstations

Page 99: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

99

Program Design

1) Identify hardware platforms available2) Identify parallelism in the application3) Choose right type of algorithmic

parallelism to match the architecture4) Implement algorithm, being wary of

performance and correctness issues

Page 100: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

100

Vector Processing (3)

this class of machines is very effective at striding through arrays

some parallelism can be automatically detected by compiler

there is “right” way and “wrong” way to code loops to allow parallelism

Page 101: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

101

Distributed Memory (1)Loosely coupled processorsIBM SP-2, networks of computersPCs connected by a fast network

Page 102: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

102

Distributed Memory (2)

communication between processes by sending messagesOverhead of parallelism includes

cost of preparing a messagecost of sending a messagecost of waiting for a message

Page 103: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

103

Communication

distributed memory(loosely coupled)

Explicitly sendmessages between processes

Page 104: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

104

Synchronization

Page 105: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

105

Message Passing (1)Process 1 Process 2compute computeSend( P2, info ); computecompute Receive( P1, info );idle computeidle Send( P1, reply );Receive( P2, reply );

SynchronizeSynchronize

CommunicateCommunicate

Page 106: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

106

Message Passing (2)

Two popular message passing libraries:PVM (Parallel Virtual Machine)MPI (Message Passing Interface)MPI will likely be the industry standardBoth are easy to use, but verbose

Page 107: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

107

Master/Slave (1)

many distributed memory programs are structured as one master and N slaves

master generates work to dowhen idle, slave asks for a piece of work, gets it, does the

task, and reports the result

Page 108: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

108

Master/Slave (2)Master Slaveworklist = Make( work ) data = NEED_WORK;while( worklist != empty ) while( true ){ { Receive( slave, result ); Send( master, data ); to_do = Head( worklist ); /* wait */ Send( slave, to_do ) Receive( master, work ); Process( result ); data = Process( work );} }

M

S

S

S

Page 109: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

109

Pitfall: Deadlock

scenario by which no further progress can be made, since each process is waiting on a cyclic dependency

Page 110: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

110

Pitfall: Deadlock

A real-world problem!

Page 111: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

111

Pitfall: Load Balancing

need to assign a roughly equal portion of work to each processor

load imbalances can result in poor speedups

Page 112: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

112

Shared Memory (1)

Tightly coupled multiprocessorsSGI Origin 2400, SUN E10000Classified by memory access times

Same (SMP - symmetric multiprocessor) Different (NUMA - non-uniform memory

access)

P P P P

Memory

Page 113: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

113

Shared Memory (2)

communicate between processes through reading/writing shared variables (instead of sending and receiving messages)Overhead of parallelism includes:

contention for shared resourceprotecting integrity of shared resources

Page 114: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

114

Communication

shared memory(tightly coupled)

Read from Read from and write to and write to shared datashared data

Page 115: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

115

SynchronizationMay have to prevent simultaneous access!Process 1 Process 2B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Print Statement; Print Statement;What is the value of BankBalance?

Page 116: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

116

Pitfall: Shared Data Access

Need to restrict access to data!Avoid race conditions!Process 1 Process 2Lock( access ); Lock( access );B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Unlock( access ); Unlock( access );

Page 117: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

117

Multi-threading

Page 118: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

118

Simultaneous Multi-threading

http://www.eet.com/story/0EG19991008S0014 by Rick Merrit

Page 119: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

119

Top 500

http://www.top500.org

Page 120: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

120

Top 500

http://www.top500.org

Page 121: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

121

Top 500

http://www.top500.org

Page 122: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

122

Top 500

http://www.top500.org

Page 123: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

123

Top 500

http://www.top500.org

Page 124: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

124

Top 500

http://www.top500.org

Page 125: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

125

Top 500

http://www.top500.org

Page 126: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

126

Conclusions

some problems require extensive computational resources

parallelism allows you to decrease experiment turnaround time

the tools are adequate but still have a long way to go

“simple” parallelism gets some performance but maximum performance requires effort.

Performance commensurate with effort!

Page 127: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

127

Reminders

understand your computational needs

understand the hardware and software resources available to you

match parallelism to the architecture

maximize utilization, don’t waste cycles!

granularity, granularity, granularity

develop, test and debug small data sets before trying large ones

be wary of the many pitfalls

Page 128: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca

MACI - University of Alberta - April 2001

128

We Want You!

For help with parallel programming, contact

[email protected]

Get parallel!Become part of MACI