27
SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

Embed Size (px)

Citation preview

Page 1: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

SE-292 High Performance ComputingIntro. To Concurrent Programming &

Parallel Architecture

R. Govindarajan

govind@serc

Page 2: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

2

PARALLEL ARCHITECTUREParallel Machine: a computer system with more

than one processor Special parallel machines designed to

make this interaction overhead less

Questions: What about Multicores? What about a network of machines?

Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent

Page 3: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

3

Classification of Parallel MachinesFlynn’s Classification In terms of number of Instruction streams and

Data streams Instruction stream: path to instruction

memory (PC) Data stream: path to data memory SISD: single instruction stream single data

stream SIMD MIMD

Page 4: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

4

PARALLEL PROGRAMMING

Recall: Flynn SIMD, MIMD Programming side: SPMD, MPMD Shared memory: threads Message passing: using messages

Speedup =

parallel

sequential

T

T

Page 5: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

5

Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors

How Much Speedup is Possible?

ns

s 1

1n

lim

Let s be the fraction of sequential execution time of a given program that can not be parallelized

Speedup = s

1

Maximum speedup achievable is limited by the sequential fraction of the sequential program

Amdahl’s Law

Page 6: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

6

Understanding Amdahl’s Law

con

curr

ency

Time

p

1

p

n2 /p

n2 /p

con

curr

ency

(c) Parallel

n2/p n2

p

1con

curr

ency

(b) Naïve Parallel

Time

1

n2 n2

(a) Serial

Time

Compute : min ( |A[i,j]- B[i,i]| )

Page 7: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

7

Classification 2:Shared Memory vs Message Passing Shared memory machine: The n processors

share physical address space Communication can be done through this shared

memory

The alternative is sometimes referred to as a message passing machine or a distributed memory machine

PP P P PP P

Interconnect

Main Memory

PP P P PP P

Interconnect

M MMMMMM

Page 8: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

8

Shared Memory MachinesThe shared memory could itself be distributed

among the processor nodes Each processor might have some portion of the

shared physical address space that is physically close to it and therefore accessible in less time

Terms: Shared vs Private Terms: Local vs Remote Terms: Centralized vs Distributed Shared Terms: NUMA vs UMA architecture

Non-Uniform Memory Access

Page 9: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

© RG@SERC,IISc 9

M

Network

° ° °

Centralized Shared Memory Uniform Memory Access (UMA)

M M

$P

$P

$P

° ° °

Network

Distributed Shared MemoryNon-Uniform Memory Access (NUMA)

M

$

P

M

$

P

° ° °

Shared Memory Architecture

Page 10: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

© RG@SERC,IISc 10

MultiCore Structure

L2-Cache

C0 C2

L1$ L1$

L2-Cache

C4 C6

L1$ L1$

L2-Cache

C1 C3

L1$ L1$

L2-Cache

C5 C7

L1$ L1$

Memory

Page 11: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

© RG@SERC,IISc 11

NUMA Architecture

L2$

C0 C2

L1$ L1$

C4 C6

L1$ L1$

L2$ L2$ L2$

L3-Cache

IMC QPI

L2$

C1 C3

L1$ L1$

C5 C7

L1$ L1$

L2$ L2$ L2$

L3-Cache

QPI IMC

Memory Memory

Page 12: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

© RG@SERC,IISc 12

Distributed Memory Architecture

Network

M $

P ° ° °M $

P

M $

P

Message Passing Architecture Memory is private to each node Processes communicate by messages

Cluster

Page 13: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

13

Parallel Architecture: Interconnections Indirect interconnects: nodes are connected

to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN

Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus,

hypercube Routing techniques: how the route taken by the

message from source to destination is decided

Page 14: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

14

Indirect Interconnects

Shared bus Multiple bus

Crossbar switch

Multistage Interconnection Network

2x2 crossbar

Page 15: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

15

Direct Interconnect Topologies

Linear RingStar

Mesh2D

Torus

Hypercube(binary n-cube)

n=2 n=3

Page 16: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

16

X: 0

Shared Memory Architecture: Caches

X: 0

Read X Read X

X: 0

Read X

Cache hit: Wrong data!!

P1 P2

Write X=1

X: 1

X: 1

Page 17: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

17

Cache Coherence Problem If each processor in a shared memory

multiple processor machine has a data cache Potential data consistency problem: the cache

coherence problem Shared variable modification, private cache

Objective: processes shouldn’t read `stale’ data

Solutions Hardware: cache coherence mechanisms Software: compiler assisted cache coherence

Page 18: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

18

Example: Write Once Protocol Assumption: shared bus interconnect where

all cache controllers monitor all bus activity Called snooping

There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or

invalidating a cache block

Page 19: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

19

X: 0

Invalidation Based Cache Coherence

X: 0

Read X Read X

X: 0

Read X

Invalidate

P1 P2

Write X=1

X: 1X: 1

X: 1

Page 20: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

20

Snoopy Cache Coherence and Locks

Test&Set(L)

L: 1

Test&Set(L) Test&Set(L) Test&Set(L)

L: 1 L: 1 L: 1

L: 0

Test&Set(L)

Holds lock; in critical sectionRelease L;

L: 0

L: 1L: 0

Page 21: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

21

Space of Parallel ComputingProgramming Models What programmer

uses in coding applns. Specifies synch. And

communication. Programming Models:

Shared address space

Message passing Data parallel

Parallel Architecture

Shared Memory Centralized shared

memory (UMA) Distributed Shared

Memory (NUMA)

Distributed Memory A.k.a. Message passing E.g., Clusters

Page 22: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

22

Memory Consistency Model

Memory consistency model Order in which memory operations will appear

to execute Þ What value can a read return?

Contract between appln. software and system.

Þ Affects ease-of-programming and performance

Page 23: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

23

Implicit Memory Model

Sequential consistency (SC) [Lamport] Result of an execution appears as if

• Operations from different processors executed in some sequential (interleaved) order

• Memory operations of each process in program order

MEMORY

P1 P3P2 Pn

Page 24: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

© RG@SERC,IISc 24

Distributed Memory Architecture

Network

M $

P ° ° °M $

P

M $

P

Message Passing Architecture Memory is private to each node Processes communicate by messages

Proc.Node

Proc.Node

Proc.Node

Page 25: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

© RG@SERC,IISc 25

NUMA Architecture

L2$

C0 C2

L1$ L1$

C4 C6

L1$ L1$

L2$ L2$ L2$

L3-Cache

IMC QPI

L2$

C1 C3

L1$ L1$

C5 C7

L1$ L1$

L2$ L2$ L2$

L3-Cache

QPI IMC

NIC

Memory MemoryIO-Hub

Page 26: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

© RG@SERC,IISc 26

Using MultiCores to Build Cluster

Memory MemoryNIC NIC

Memory MemoryNIC NIC

N/WSwitch

Node 0 Node 1

Node 3 Node 2

Page 27: SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc

Multi-core Workshop© RG@SERC,IISc 27

Using MultiCores to Build Cluster

Memory MemoryNIC NIC

Memory MemoryNIC NIC

N/WSwitch

Node 0 Node 1

Node 3 Node 2

Send A(1:N)