Upload
jalynn-leece
View
214
Download
0
Embed Size (px)
Citation preview
SE-292 High Performance ComputingIntro. To Concurrent Programming &
Parallel Architecture
R. Govindarajan
govind@serc
2
PARALLEL ARCHITECTUREParallel Machine: a computer system with more
than one processor Special parallel machines designed to
make this interaction overhead less
Questions: What about Multicores? What about a network of machines?
Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent
3
Classification of Parallel MachinesFlynn’s Classification In terms of number of Instruction streams and
Data streams Instruction stream: path to instruction
memory (PC) Data stream: path to data memory SISD: single instruction stream single data
stream SIMD MIMD
4
PARALLEL PROGRAMMING
Recall: Flynn SIMD, MIMD Programming side: SPMD, MPMD Shared memory: threads Message passing: using messages
Speedup =
parallel
sequential
T
T
5
Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors
How Much Speedup is Possible?
ns
s 1
1n
lim
Let s be the fraction of sequential execution time of a given program that can not be parallelized
Speedup = s
1
Maximum speedup achievable is limited by the sequential fraction of the sequential program
Amdahl’s Law
6
Understanding Amdahl’s Law
con
curr
ency
Time
p
1
p
n2 /p
n2 /p
con
curr
ency
(c) Parallel
n2/p n2
p
1con
curr
ency
(b) Naïve Parallel
Time
1
n2 n2
(a) Serial
Time
Compute : min ( |A[i,j]- B[i,i]| )
7
Classification 2:Shared Memory vs Message Passing Shared memory machine: The n processors
share physical address space Communication can be done through this shared
memory
The alternative is sometimes referred to as a message passing machine or a distributed memory machine
PP P P PP P
Interconnect
Main Memory
PP P P PP P
Interconnect
M MMMMMM
8
Shared Memory MachinesThe shared memory could itself be distributed
among the processor nodes Each processor might have some portion of the
shared physical address space that is physically close to it and therefore accessible in less time
Terms: Shared vs Private Terms: Local vs Remote Terms: Centralized vs Distributed Shared Terms: NUMA vs UMA architecture
Non-Uniform Memory Access
© RG@SERC,IISc 9
M
Network
° ° °
Centralized Shared Memory Uniform Memory Access (UMA)
M M
$P
$P
$P
° ° °
Network
Distributed Shared MemoryNon-Uniform Memory Access (NUMA)
M
$
P
M
$
P
° ° °
Shared Memory Architecture
© RG@SERC,IISc 10
MultiCore Structure
L2-Cache
C0 C2
L1$ L1$
L2-Cache
C4 C6
L1$ L1$
L2-Cache
C1 C3
L1$ L1$
L2-Cache
C5 C7
L1$ L1$
Memory
© RG@SERC,IISc 11
NUMA Architecture
L2$
C0 C2
L1$ L1$
C4 C6
L1$ L1$
L2$ L2$ L2$
L3-Cache
IMC QPI
L2$
C1 C3
L1$ L1$
C5 C7
L1$ L1$
L2$ L2$ L2$
L3-Cache
QPI IMC
Memory Memory
© RG@SERC,IISc 12
Distributed Memory Architecture
Network
M $
P ° ° °M $
P
M $
P
Message Passing Architecture Memory is private to each node Processes communicate by messages
Cluster
13
Parallel Architecture: Interconnections Indirect interconnects: nodes are connected
to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN
Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus,
hypercube Routing techniques: how the route taken by the
message from source to destination is decided
14
Indirect Interconnects
Shared bus Multiple bus
Crossbar switch
Multistage Interconnection Network
2x2 crossbar
15
Direct Interconnect Topologies
Linear RingStar
Mesh2D
Torus
Hypercube(binary n-cube)
n=2 n=3
16
X: 0
Shared Memory Architecture: Caches
X: 0
Read X Read X
X: 0
Read X
Cache hit: Wrong data!!
P1 P2
Write X=1
X: 1
X: 1
17
Cache Coherence Problem If each processor in a shared memory
multiple processor machine has a data cache Potential data consistency problem: the cache
coherence problem Shared variable modification, private cache
Objective: processes shouldn’t read `stale’ data
Solutions Hardware: cache coherence mechanisms Software: compiler assisted cache coherence
18
Example: Write Once Protocol Assumption: shared bus interconnect where
all cache controllers monitor all bus activity Called snooping
There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or
invalidating a cache block
19
X: 0
Invalidation Based Cache Coherence
X: 0
Read X Read X
X: 0
Read X
Invalidate
P1 P2
Write X=1
X: 1X: 1
X: 1
20
Snoopy Cache Coherence and Locks
Test&Set(L)
L: 1
Test&Set(L) Test&Set(L) Test&Set(L)
L: 1 L: 1 L: 1
L: 0
Test&Set(L)
Holds lock; in critical sectionRelease L;
L: 0
L: 1L: 0
21
Space of Parallel ComputingProgramming Models What programmer
uses in coding applns. Specifies synch. And
communication. Programming Models:
Shared address space
Message passing Data parallel
Parallel Architecture
Shared Memory Centralized shared
memory (UMA) Distributed Shared
Memory (NUMA)
Distributed Memory A.k.a. Message passing E.g., Clusters
22
Memory Consistency Model
Memory consistency model Order in which memory operations will appear
to execute Þ What value can a read return?
Contract between appln. software and system.
Þ Affects ease-of-programming and performance
23
Implicit Memory Model
Sequential consistency (SC) [Lamport] Result of an execution appears as if
• Operations from different processors executed in some sequential (interleaved) order
• Memory operations of each process in program order
MEMORY
P1 P3P2 Pn
© RG@SERC,IISc 24
Distributed Memory Architecture
Network
M $
P ° ° °M $
P
M $
P
Message Passing Architecture Memory is private to each node Processes communicate by messages
Proc.Node
Proc.Node
Proc.Node
© RG@SERC,IISc 25
NUMA Architecture
L2$
C0 C2
L1$ L1$
C4 C6
L1$ L1$
L2$ L2$ L2$
L3-Cache
IMC QPI
L2$
C1 C3
L1$ L1$
C5 C7
L1$ L1$
L2$ L2$ L2$
L3-Cache
QPI IMC
NIC
Memory MemoryIO-Hub
© RG@SERC,IISc 26
Using MultiCores to Build Cluster
Memory MemoryNIC NIC
Memory MemoryNIC NIC
N/WSwitch
Node 0 Node 1
Node 3 Node 2
Multi-core Workshop© RG@SERC,IISc 27
Using MultiCores to Build Cluster
Memory MemoryNIC NIC
Memory MemoryNIC NIC
N/WSwitch
Node 0 Node 1
Node 3 Node 2
Send A(1:N)