High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

High Performance Parallel Programming

Dirk van der KnijffAdvanced Research Computing

Information Division



Lecture 4: Introduction to Parallel ArchitecturesDistributed Memory Systems


Admimistrivia

I would like to organise a tutorial session for those students who would like an account on the HPC systems. This tute will cover the actual physical machines we have and how to use the batch system to actually submit parallel jobs.

lecture slides: ppt vs pdf?


Lecture 3

We introduced parallel architectures and then looked closely at Shared Memory Systems. In particular we looked at the issues relating to cache coherency and false sharing.

Note: Distributed Memory machines don’t avoid the problems which cause cache coherency problems - they just offload them to you!


Recap - Architecture

P P P P

Interconnection Network

M M MM

° Where is the memory physically located?

Memory


Distributed Memory

• Each processor is connected to its own memory and cache and cannot directly access another processors memory

• ccNUMA is a special case...

P P P P

Interconnection Network

M M MM


Topologies

• Early distributed memory machines were built using internode connections (limited numbers)

• Messages were forwarded by processors on path• There was a strong emphasis on topology in

algorithms to minimize number of hops

e.g. Hypercubenumber of hops = log2n


Network Analogy• To have a large number of transfers occurring at once,

you need a large number of distinct wires.• Networks are like streets:

• Link = street.• Switch = intersection.• Distances (hops) = number of blocks traveled.• Routing algorithm = travel plan.

• Properties:• Latency: how long to get between nodes in the network.• Bandwidth: how much data can be moved per unit time:• Bandwidth is limited by the number of wires and the

rate at which each wire can accept data.


Characteristics of a Network– Topology (how things are connected)

• Crossbar, ring, 2-D and 2-D torus, hypercube, omega network.

– Routing algorithm:• Example: all east-west then all north-south (avoids deadlock).

– Switching strategy:• Circuit switching: full path reserved for entire message, like

the telephone.• Packet switching: message broken into separately-routed

packets, like the post office.

– Flow control (what if there is congestion):• Stall, store data temporarily in buffers, re-route data to other

nodes, tell source node to temporarily halt, discard, etc.


Properties of a Network

– Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes.

– A network is partitioned into two or more disjoint sub-graphs if some nodes cannot reach others.

– The bandwidth of a link = w * 1/t• w is the number of wires• t is the time per bit

– Effective bandwidth is usually lower due to packet overhead.– Bisection bandwidth: sum of the bandwidths of the minimum

number of channels which, if removed, would partition the network into two sub-graphs.


Network Topology– In the early years of parallel computing, there was considerable

research in network topology and in mapping algorithms to topology.– Key cost to be minimized in early years: number of “hops”

(communication steps) between nodes.– Modern networks hide hop cost (ie, “wormhole routing”), so the

underlying topology is no longer a major factor in algorithm performance.

– Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec.However, since some algorithms have a natural topology, it is worthwhile to have some background in this arena.


Linear and Ring Topologies– Linear array

• Diameter = n-1; average distance ~ n/3.• Bisection bandwidth = 1.

– Torus or Ring

• Diameter = n/2; average distance ~ n/4.• Bisection bandwidth = 2.• Natural for algorithms that work with 1D arrays.


Meshes and Tori– 2D

• Diameter = 2 * n• Bisection bandwidth = n• 2D mesh 2D

torus

°Generalizes to higher dimensions (Cray T3E uses 3D Torus).° Natural for algorithms that work with 2D and/or 3D arrays.


Hypercubes– Number of nodes n = 2d for dimension d.

• Diameter = d. • Bisection bandwidth = n/2.

– 0d 1d 2d 3d 4d– Popular in early machines (Intel iPSC, NCUBE).

• Lots of clever algorithms.

– Greycode addressing:• Each node connected to

d others with 1 bit different. 001000

100

010 011

111

101

110


Trees– Diameter = log n.– Bisection bandwidth = 1.– Easy layout as planar graph.– Many tree algorithms (e.g., summation).– Fat trees avoid bisection bandwidth problem:

• More (or wider) links near top.• Example: Thinking Machines CM-5.


Butterflies– Diameter = log n.– Bisection bandwidth = n.– Cost: lots of wires.– Used in BBN Butterfly.– Natural for FFT.

O 1O 1

O 1 O 1


Evolution of Distributed Memory Multiprocessors

– Special queue connections are being replaced by direct memory access (DMA):

• Processor packs or copies messages.• Initiates transfer, goes on computing.

– Message passing libraries provide store-and-forward abstraction:• Can send/receive between any pair of nodes, not just along one

wire.• Time proportional to distance since each processor along path

must participate.

– Wormhole routing in hardware:• Special message processors do not interrupt main processors

along path.• Message sends are pipelined.• Processors don’t wait for complete message before forwarding.


Networks of Workstations

• Todays dominant paridigm• Package solutions available

– IBM SP2, Compaq Alphaserver SC...

• Fast commodity interconnects now available– Myrinet, ServerNet...

• Many applications do not need close coupling– even Ethernet will do.

• Cheap!!!


Latency and Bandwidth Model– Time to send message of length n is roughly.

Time = latency + n*cost_per_word= latency + n/bandwidth

– Topology is assumed irrelevant.– Often called “ model” and written

Time = + n*

– Usually >> >> time per flop.• One long message is cheaper than many short ones.

+ n* << n*( + 1* )

• Can do hundreds or thousands of flops for cost of one message.

– Lesson: Need large computation-to-communication ratio to be efficient.


A more detailed performance model: LogP

– L: latency across the network (may be variable). – o: overhead (sending and receiving busy time).– g: gap between messages (1/bandwidth).– P: number of processors.

– People often group overheads into latency ( model).– Real costs more complicated

P M P M

os or

L (latency)


Local Machines

• NEC SX-4 Parallel Vector Processor– 2 cpus, X-bar

• Compaq ES40 - SMP– 4 cpus, X-bar

• Compaq DS10,PW600 - Network of Workstations– 28 cpus, Fast Ethernet

• VPAC AlphaserverSC - Cluster of SMPs– 128 cpus (4 * 32), Quadrics Switch


Programming Models

Machine architectures and Programming models have a natural affinity, both in their descriptions and in there effects on each other. Algorithms have been developed to fit machines and machines have been built to fit algorithms.

Purpose built Supercomputers - ASCIFPGAs - QCDmachines


Parallel Programming Models

– Control• How is parallelism created?• What orderings exist between operations?• How do different threads of control synchronize?

– Data• What data is private vs. shared?• How is logically shared data accessed or

communicated?

– Operations• What are the atomic operations?

– Cost• How do we account for the cost of each of the above?


Trivial Example– Parallel Decomposition:

• Each evaluation and each partial sum is a task.

– Assign n/p numbers to each of p procs• Each computes independent “private” results and partial

sum.• One (or all) collects the p partial sums and computes the

global sum.

Two Classes of Data: – Logically Shared

• The original n numbers, the global sum.

– Logically Private• The individual function evaluations.• What about the individual partial sums?

f A i( [ ])i0

n 1


Model 1: Shared Address Space• Program consists of a collection of threads of control.• Each has a set of private variables, e.g. local variables on the

stack. • Collectively with a set of shared variables, e.g., static variables,

shared common blocks, global heap.• Threads communicate implicitly by writing and reading shared

variables.• Threads coordinate explicitly by synchronization operations on

shared variables -- writing and reading flags, locks or semaphores.

• Like concurrent programming on a uniprocessor.


Shared Address Space

PPP . . .

x = ...y = ..x ...

Shared

Private

Machine model - Shared memory


Shared Memory Code for Computing a Sum

Thread 1

[s = 0 initially]local_s1= 0for i = 0, n/2-1 local_s1 = local_s1 + f(A[i])s = s + local_s1

Thread 2

[s = 0 initially]local_s2 = 0for i = n/2, n-1 local_s2= local_s2 + f(A[i])s = s +local_s2

What could go wrong?


Solution via Synchronization

° Instructions from different threads can be interleaved arbitrarily.° What can final result s stored in memory be?° Problem: race condition.° Possible solution: mutual exclusion with locks

° Pitfall in computing a global sum s = local_s1 + local_s2: Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem]

Thread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem]

Time

Thread 1 lock load s s = s+local_s1 store s unlock

Thread 2 lock load s s = s+local_s2 store s unlock

° Locks must be atomic (execute completely without interruption).


Model 2: Message Passing

• Program consists of a collection of named processes.• Thread of control plus local address space -- NO shared data.• Local variables, static variables, common blocks, heap.• Processes communicate by explicit data transfers -- matching

send and receive pair by source and destination processors(these may be broadcast).

• Coordination is implicit in every communication event.• Logically shared data is partitioned over local processes.• Like distributed programming -- program with MPI, PVM.


Message Passing

PPP . . .

send P0,Xrecv Pn,Y

XY

n0

Machine model - Distributed Memory


Computing s = x(1)+x(2) on each processor

° First possible solution:Processor 1 send xlocal, proc2

[xlocal = x(1)]receive xremote, proc2s = xlocal + xremote

Processor 2receive xremote, proc1send xlocal, proc1

[xlocal = x(2)]s = xlocal + xremote

° Second possible solution -- what could go wrong?

Processor 1send xlocal, proc2


Processor 2send xlocal, proc1


° What if send/receive acts like the telephone system? The post office?


Model 3: Data Parallel• Single sequential thread of control consisting

of parallel operations.• Parallel operations applied to all (or a defined

subset) of a data structure.• Communication is implicit in parallel operators

and “shifted” data structures.• Elegant and easy to understand and reason

about.• Like marching in a regiment.• Used by Matlab.• Drawback: not all problems fit this model.


Data ParallelA:

fA:f

sum

A = array of all datafA = f(A)s = sum(fA)

s:

Machine model - SIMDA large number of (usually) small processors.A single “control processor” issues each instruction.Each processor executes the same instruction.Some processors may be turned off on some instructions.Machines are not popular (CM2), but programming model is.Implemented by mapping n-fold parallelism to p processors.Mostly done in the compilers (HPF = High Performance Fortran).


SIMD Architecture

– A large number of (usually) small processors.– A single “control processor” issues each instruction.– Each processor executes the same instruction.– Some processors may be turned off on some

instructions.– Machines are not popular (CM2), but programming

model is.– Implemented by mapping n-fold parallelism to p

processors.– Programming

• Mostly done in the compilers (HPF = High Performance Fortran).

• Usually modified to Single Program Multiple Data or SPMD



Next week - Programming

Documents

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division