View
225
Download
5
Embed Size (px)
Citation preview
High Performance Parallel Programming
Dirk van der KnijffAdvanced Research Computing
Information Division
High Performance Parallel Programming
High Performance Parallel Programming
Lecture 4: Introduction to Parallel ArchitecturesDistributed Memory Systems
High Performance Parallel Programming
Admimistrivia
I would like to organise a tutorial session for those students who would like an account on the HPC systems. This tute will cover the actual physical machines we have and how to use the batch system to actually submit parallel jobs.
lecture slides: ppt vs pdf?
High Performance Parallel Programming
Lecture 3
We introduced parallel architectures and then looked closely at Shared Memory Systems. In particular we looked at the issues relating to cache coherency and false sharing.
Note: Distributed Memory machines don’t avoid the problems which cause cache coherency problems - they just offload them to you!
High Performance Parallel Programming
Recap - Architecture
P P P P
Interconnection Network
M M MM
° Where is the memory physically located?
Memory
High Performance Parallel Programming
Distributed Memory
• Each processor is connected to its own memory and cache and cannot directly access another processors memory
• ccNUMA is a special case...
P P P P
Interconnection Network
M M MM
High Performance Parallel Programming
Topologies
• Early distributed memory machines were built using internode connections (limited numbers)
• Messages were forwarded by processors on path• There was a strong emphasis on topology in
algorithms to minimize number of hops
e.g. Hypercubenumber of hops = log2n
High Performance Parallel Programming
Network Analogy• To have a large number of transfers occurring at once,
you need a large number of distinct wires.• Networks are like streets:
• Link = street.• Switch = intersection.• Distances (hops) = number of blocks traveled.• Routing algorithm = travel plan.
• Properties:• Latency: how long to get between nodes in the network.• Bandwidth: how much data can be moved per unit time:• Bandwidth is limited by the number of wires and the
rate at which each wire can accept data.
High Performance Parallel Programming
Characteristics of a Network– Topology (how things are connected)
• Crossbar, ring, 2-D and 2-D torus, hypercube, omega network.
– Routing algorithm:• Example: all east-west then all north-south (avoids deadlock).
– Switching strategy:• Circuit switching: full path reserved for entire message, like
the telephone.• Packet switching: message broken into separately-routed
packets, like the post office.
– Flow control (what if there is congestion):• Stall, store data temporarily in buffers, re-route data to other
nodes, tell source node to temporarily halt, discard, etc.
High Performance Parallel Programming
Properties of a Network
– Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes.
– A network is partitioned into two or more disjoint sub-graphs if some nodes cannot reach others.
– The bandwidth of a link = w * 1/t• w is the number of wires• t is the time per bit
– Effective bandwidth is usually lower due to packet overhead.– Bisection bandwidth: sum of the bandwidths of the minimum
number of channels which, if removed, would partition the network into two sub-graphs.
High Performance Parallel Programming
Network Topology– In the early years of parallel computing, there was considerable
research in network topology and in mapping algorithms to topology.– Key cost to be minimized in early years: number of “hops”
(communication steps) between nodes.– Modern networks hide hop cost (ie, “wormhole routing”), so the
underlying topology is no longer a major factor in algorithm performance.
– Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec.However, since some algorithms have a natural topology, it is worthwhile to have some background in this arena.
High Performance Parallel Programming
Linear and Ring Topologies– Linear array
• Diameter = n-1; average distance ~ n/3.• Bisection bandwidth = 1.
– Torus or Ring
• Diameter = n/2; average distance ~ n/4.• Bisection bandwidth = 2.• Natural for algorithms that work with 1D arrays.
High Performance Parallel Programming
Meshes and Tori– 2D
• Diameter = 2 * n• Bisection bandwidth = n• 2D mesh 2D
torus
°Generalizes to higher dimensions (Cray T3E uses 3D Torus).° Natural for algorithms that work with 2D and/or 3D arrays.
High Performance Parallel Programming
Hypercubes– Number of nodes n = 2d for dimension d.
• Diameter = d. • Bisection bandwidth = n/2.
– 0d 1d 2d 3d 4d– Popular in early machines (Intel iPSC, NCUBE).
• Lots of clever algorithms.
– Greycode addressing:• Each node connected to
d others with 1 bit different. 001000
100
010 011
111
101
110
High Performance Parallel Programming
Trees– Diameter = log n.– Bisection bandwidth = 1.– Easy layout as planar graph.– Many tree algorithms (e.g., summation).– Fat trees avoid bisection bandwidth problem:
• More (or wider) links near top.• Example: Thinking Machines CM-5.
High Performance Parallel Programming
Butterflies– Diameter = log n.– Bisection bandwidth = n.– Cost: lots of wires.– Used in BBN Butterfly.– Natural for FFT.
O 1O 1
O 1 O 1
High Performance Parallel Programming
Evolution of Distributed Memory Multiprocessors
– Special queue connections are being replaced by direct memory access (DMA):
• Processor packs or copies messages.• Initiates transfer, goes on computing.
– Message passing libraries provide store-and-forward abstraction:• Can send/receive between any pair of nodes, not just along one
wire.• Time proportional to distance since each processor along path
must participate.
– Wormhole routing in hardware:• Special message processors do not interrupt main processors
along path.• Message sends are pipelined.• Processors don’t wait for complete message before forwarding.
High Performance Parallel Programming
Networks of Workstations
• Todays dominant paridigm• Package solutions available
– IBM SP2, Compaq Alphaserver SC...
• Fast commodity interconnects now available– Myrinet, ServerNet...
• Many applications do not need close coupling– even Ethernet will do.
• Cheap!!!
High Performance Parallel Programming
Latency and Bandwidth Model– Time to send message of length n is roughly.
Time = latency + n*cost_per_word= latency + n/bandwidth
– Topology is assumed irrelevant.– Often called “ model” and written
Time = + n*
– Usually >> >> time per flop.• One long message is cheaper than many short ones.
+ n* << n*( + 1* )
• Can do hundreds or thousands of flops for cost of one message.
– Lesson: Need large computation-to-communication ratio to be efficient.
High Performance Parallel Programming
A more detailed performance model: LogP
– L: latency across the network (may be variable). – o: overhead (sending and receiving busy time).– g: gap between messages (1/bandwidth).– P: number of processors.
– People often group overheads into latency ( model).– Real costs more complicated
P M P M
os or
L (latency)
High Performance Parallel Programming
Local Machines
• NEC SX-4 Parallel Vector Processor– 2 cpus, X-bar
• Compaq ES40 - SMP– 4 cpus, X-bar
• Compaq DS10,PW600 - Network of Workstations– 28 cpus, Fast Ethernet
• VPAC AlphaserverSC - Cluster of SMPs– 128 cpus (4 * 32), Quadrics Switch
High Performance Parallel Programming
Programming Models
Machine architectures and Programming models have a natural affinity, both in their descriptions and in there effects on each other. Algorithms have been developed to fit machines and machines have been built to fit algorithms.
Purpose built Supercomputers - ASCIFPGAs - QCDmachines
High Performance Parallel Programming
Parallel Programming Models
– Control• How is parallelism created?• What orderings exist between operations?• How do different threads of control synchronize?
– Data• What data is private vs. shared?• How is logically shared data accessed or
communicated?
– Operations• What are the atomic operations?
– Cost• How do we account for the cost of each of the above?
High Performance Parallel Programming
Trivial Example– Parallel Decomposition:
• Each evaluation and each partial sum is a task.
– Assign n/p numbers to each of p procs• Each computes independent “private” results and partial
sum.• One (or all) collects the p partial sums and computes the
global sum.
Two Classes of Data: – Logically Shared
• The original n numbers, the global sum.
– Logically Private• The individual function evaluations.• What about the individual partial sums?
f A i( [ ])i0
n 1
High Performance Parallel Programming
Model 1: Shared Address Space• Program consists of a collection of threads of control.• Each has a set of private variables, e.g. local variables on the
stack. • Collectively with a set of shared variables, e.g., static variables,
shared common blocks, global heap.• Threads communicate implicitly by writing and reading shared
variables.• Threads coordinate explicitly by synchronization operations on
shared variables -- writing and reading flags, locks or semaphores.
• Like concurrent programming on a uniprocessor.
High Performance Parallel Programming
Shared Address Space
PPP . . .
x = ...y = ..x ...
Shared
Private
Machine model - Shared memory
High Performance Parallel Programming
Shared Memory Code for Computing a Sum
Thread 1
[s = 0 initially]local_s1= 0for i = 0, n/2-1 local_s1 = local_s1 + f(A[i])s = s + local_s1
Thread 2
[s = 0 initially]local_s2 = 0for i = n/2, n-1 local_s2= local_s2 + f(A[i])s = s +local_s2
What could go wrong?
High Performance Parallel Programming
Solution via Synchronization
° Instructions from different threads can be interleaved arbitrarily.° What can final result s stored in memory be?° Problem: race condition.° Possible solution: mutual exclusion with locks
° Pitfall in computing a global sum s = local_s1 + local_s2: Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem]
Thread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem]
Time
Thread 1 lock load s s = s+local_s1 store s unlock
Thread 2 lock load s s = s+local_s2 store s unlock
° Locks must be atomic (execute completely without interruption).
High Performance Parallel Programming
Model 2: Message Passing
• Program consists of a collection of named processes.• Thread of control plus local address space -- NO shared data.• Local variables, static variables, common blocks, heap.• Processes communicate by explicit data transfers -- matching
send and receive pair by source and destination processors(these may be broadcast).
• Coordination is implicit in every communication event.• Logically shared data is partitioned over local processes.• Like distributed programming -- program with MPI, PVM.
High Performance Parallel Programming
Message Passing
PPP . . .
send P0,Xrecv Pn,Y
XY
n0
Machine model - Distributed Memory
High Performance Parallel Programming
Computing s = x(1)+x(2) on each processor
° First possible solution:Processor 1 send xlocal, proc2
[xlocal = x(1)]receive xremote, proc2s = xlocal + xremote
Processor 2receive xremote, proc1send xlocal, proc1
[xlocal = x(2)]s = xlocal + xremote
° Second possible solution -- what could go wrong?
Processor 1send xlocal, proc2
[xlocal = x(1)]receive xremote, proc2s = xlocal + xremote
Processor 2send xlocal, proc1
[xlocal = x(2)]receive xremote, proc1s = xlocal + xremote
° What if send/receive acts like the telephone system? The post office?
High Performance Parallel Programming
Model 3: Data Parallel• Single sequential thread of control consisting
of parallel operations.• Parallel operations applied to all (or a defined
subset) of a data structure.• Communication is implicit in parallel operators
and “shifted” data structures.• Elegant and easy to understand and reason
about.• Like marching in a regiment.• Used by Matlab.• Drawback: not all problems fit this model.
High Performance Parallel Programming
Data ParallelA:
fA:f
sum
A = array of all datafA = f(A)s = sum(fA)
s:
Machine model - SIMDA large number of (usually) small processors.A single “control processor” issues each instruction.Each processor executes the same instruction.Some processors may be turned off on some instructions.Machines are not popular (CM2), but programming model is.Implemented by mapping n-fold parallelism to p processors.Mostly done in the compilers (HPF = High Performance Fortran).
High Performance Parallel Programming
SIMD Architecture
– A large number of (usually) small processors.– A single “control processor” issues each instruction.– Each processor executes the same instruction.– Some processors may be turned off on some
instructions.– Machines are not popular (CM2), but programming
model is.– Implemented by mapping n-fold parallelism to p
processors.– Programming
• Mostly done in the compilers (HPF = High Performance Fortran).
• Usually modified to Single Program Multiple Data or SPMD
High Performance Parallel Programming
High Performance Parallel Programming
Next week - Programming