45
Parallel Programming Sathish S. Vadhiyar

Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

Embed Size (px)

Citation preview

Page 1: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

Parallel Programming

Sathish S. Vadhiyar

Page 2: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

2

Motivations of Parallel ComputingParallel Machine: a computer system with more than one

processorMotivations

• Faster Execution time due to non-dependencies between regions of code

• Presents a level of modularity

• Resource constraints. Large databases.

• Certain class of algorithms lend themselves

• Aggregate bandwidth to memory/disk. Increase in data throughput.

• Clock rate improvement in the past decade – 40%

• Memory access time improvement in the past decade – 10%

Page 3: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

3

Parallel Programming and Challenges

Recall the advantages and motivation of parallelism

But parallel programs incur overheads not seen in sequential programs Communication delay Idling Synchronization

Page 4: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

4

Challenges

P0

P1

Idle timeComputation

Communication

Synchronization

Page 5: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

5

How do we evaluate a parallel program? Execution time, Tp Speedup, S

S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)

Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1

Scalability – Limitations in parallel computing, relation to n and p.

Page 6: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

6

Speedups and efficiency

Ideal p

S

Practical

p

E

Page 7: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

7

Limitations on speedup – Amdahl’s law Amdahl's law states that the performance

improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.

Places a limit on the speedup due to parallelism.

Speedup = 1(fs + (fp/P))

Page 8: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

Amdahl’s law Illustration

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

Efficiency

0

0.2

0.4

0.6

0.8

1

0 5 10 15

S = 1 / (s + (1-s)/p)

Page 9: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

Amdahl’s law analysis

f P=1 P=4 P=8 P=16 P=32

1.00 1.0 4.00 8.00 16.00 32.00

0.99 1.0 3.88 7.48 13.91 24.43

0.98 1.0 3.77 7.02 12.31 19.75

0.96 1.0 3.57 6.25 10.00 14.29

•For the same fraction, speedup numbers keep moving away from processor size.

•Thus Amdahl’s law is a bit depressing for parallel programming.

•In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Page 10: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time

on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time

For a particular number of processors, find the problem size for which parallel time is equal to the constant time

For that problem size, find the sequential time and the corresponding speedup

Thus speedup is scaled or scaled speedup. Also called weak speedup

Page 11: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

11

Metrics (Contd..)

N P=1 P=4 P=8 P=16 P=32

64 1.0 0.80 0.57 0.33

192 1.0 0.92 0.80 0.60

512 1.0 0.97 0.91 0.80

Table 5.1: Efficiency as a function of n and p.

Page 12: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

12

Scalability Efficiency decreases with increasing P;

increases with increasing N How effectively the parallel algorithm can

use an increasing number of processors How the amount of computation performed

must scale with P to keep E constant This function of computation in terms of P is

called isoefficiency function. An algorithm with an isoefficiency function

of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

Page 13: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

13

Parallel Program Models

Single Program Multiple Data (SPMD)

Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 14: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

14

Programming Paradigms

Shared memory model – Threads, OpenMP, CUDA

Message passing model – MPI

Page 15: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

PARALLELIZATION

Page 16: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

16

Parallelizing a ProgramGiven a sequential program/algorithm, how

to go about producing a parallel versionFour steps in program parallelization

1. DecompositionIdentifying parallel tasks with large extent of possible

concurrent activity; splitting the problem into tasks

2. AssignmentGrouping the tasks into processes with best load

balancing

3. OrchestrationReducing synchronization and communication costs

4. MappingMapping of processes to processors (if possible)

Page 17: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

17

Steps in Creating a Parallel Program

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

Page 18: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

18

Orchestration

GoalsStructuring communicationSynchronization

ChallengesOrganizing data structures – packingSmall or large messages?How to organize communication and

synchronization ?

Page 19: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

19

OrchestrationMaximizing data locality

Minimizing volume of data exchangeNot communicating intermediate results – e.g. dot product

Minimizing frequency of interactions - packingMinimizing contention and hot spots

Do not use the same communication pattern with the other processes in all the processes

Overlapping computations with interactionsSplit computations into phases: those that depend on

communicated data (type 1) and those that do not (type 2)

Initiate communication for type 1; During communication, perform type 2

Replicating data or computationsBalancing the extra computation or storage cost with

the gain due to less communication

Page 20: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

20

Mapping

Which process runs on which particular processor?Can depend on network topology,

communication pattern of processorsOn processor speeds in case of

heterogeneous systems

Page 21: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

21

Mapping

Static mapping Mapping based on Data partitioning

Applicable to dense matrix computations Block distribution Block-cyclic distribution

Graph partitioning based mapping Applicable for sparse matrix computations

Mapping based on task partitioning

0 0 0 1 1 1 2 2 2

0 1 0 1 2 0 1 22

Page 22: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

22

Based on Task Partitioning Based on task dependency graph

In general the problem is NP complete

0

0 4

0 2 4 6

0 1 2 3 4 5 6 7

Page 23: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

23

Mapping

Dynamic Mapping A process/global memory can hold a set

of tasks Distribute some tasks to all processes Once a process completes its tasks, it

asks the coordinator process for more tasks

Referred to as self-scheduling, work-stealing

Page 24: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

24

High-level GoalsTable 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology

Page 25: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

PARALLEL ARCHITECTURE

25

Page 26: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

26

Classification of Architectures – Flynn’s classification

In terms of parallelism in instruction and data stream

Single Instruction Single Data (SISD): Serial Computers

Single Instruction Multiple Data (SIMD)

- Vector processors and processor arrays

- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 27: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

27

Classification of Architectures – Flynn’s classification

Multiple Instruction Single Data (MISD): Not popular

Multiple Instruction Multiple Data (MIMD)

- Most popular - IBM SP and most other

supercomputers, clusters,

computational Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 28: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

28

Classification of Architectures – Based on Memory Shared memory 2 types – UMA and

NUMA

UMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 29: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

29

Classification 2:Shared Memory vs Message Passing

Shared memory machine: The n processors share physical address space Communication can be done through this

shared memory

The alternative is sometimes referred to as a message passing machine or a distributed memory machine

PP P P PP P

Interconnect

Main Memory

PP P P PP P

Interconnect

M MMMMMM

Page 30: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

30

Shared Memory Machines

The shared memory could itself be distributed among the processor nodes Each processor might have some portion

of the shared physical address space that is physically close to it and therefore accessible in less time

Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access

Page 31: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

31

Classification of Architectures – Based on Memory Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Recently multi-cores Yet another classification – MPPs,

NOW (Berkeley), COW, Computational Grids

Page 32: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

32

Parallel Architecture: Interconnection Networks

An interconnection network defined by switches, links and interfaces Switches – provide mapping between input

and output ports, buffering, routing etc. Interfaces – connects nodes with network

Page 33: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

33

Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to

interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN

Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message

from source to destination is decided Network topologies

Static – point-to-point communication links among processing nodes

Dynamic – Communication links are formed dynamically by switches

Page 34: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

34

Interconnection Networks Static

Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network

Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.

For more details, and evaluation of topologies, refer to book by Grama et al.

Page 35: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

35

Indirect InterconnectsShared bus

Multiple bus

Crossbar switch

Multistage Interconnection Network

2x2 crossbar

Page 36: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

36

Direct Interconnect TopologiesLinear

RingStar

Mesh2D

Torus

Hypercube(binary n-cube)

n=2 n=3

Page 37: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

37

Evaluating Interconnection topologies Diameter – maximum distance between any two

processing nodes Full-connected – Star – Ring – Hypercube -

Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes –

12p/2

logP

12

24

d

Page 38: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

38

Evaluating Interconnection topologies bisection width – minimum number of

links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes -

2

Root(P)

1

1

P2/4

P/2

Page 39: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

39

Evaluating Interconnection topologies channel width – number of bits that can be

simultaneously communicated over a link, i.e. number of physical wires between 2 nodes

channel rate – performance of a single physical wire

channel bandwidth – channel rate times channel width

bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

Page 40: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

40

X: 0

Shared Memory Architecture: Caches

X: 0

Read X Read X

X: 0

Read X

Cache hit: Wrong data!!

P1 P2

Write X=1

X: 1

X: 1

Page 41: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

41

Cache Coherence Problem

If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache

coherence problem Shared variable modification, private cache

Objective: processes shouldn’t read `stale’ data

Solutions Hardware: cache coherence mechanisms

Page 42: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

42

Cache Coherence Protocols

Write update – propagate cache line to other processors on every write to a processor

Write invalidate – each processor gets the updated cache line whenever it reads stale data

Which is better?

Page 43: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

43

X: 0

Invalidation Based Cache Coherence

X: 0

Read X Read X

X: 0

Read X

Invalidate

P1 P2

Write X=1

X: 1X: 1

X: 1

Page 44: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

44

Cache Coherence using invalidate protocols

3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the

data item Dirty – state of the data item in P0

Implementations Snoopy

for bus based architectures shared bus interconnect where all cache controllers monitor all

bus activity There is only one operation through bus at a time; cache

controllers can be built to take corrective action and enforce coherence in caches

Memory operations are propagated over the bus and snooped Directory-based

Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors

A central directory maintains states of cache blocks, associated processors

Implemented with presence bits

Page 45: Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations

END