NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/files/1.pdf · Example SIMD machine: Vector computers A variant of SIMD: a pipeline Complicated operations often take more

NUMERICAL PARALLEL COMPUTING

NUMERICAL PARALLEL COMPUTINGLecture 1, March 23, 2007: Introduction

Peter Arbenz

Institute of Computational Science, ETH Z”urichE-mail: [email protected]

http://people.inf.ethz.ch/arbenz/PARCO/

1

http://people.inf.ethz.ch/arbenz/PARCO/


Organization

Organization: People/Exercises

1. Lecturer:Peter Arbenz, CAB G69.3 (Universitatsstrasse 6)Tel. 632 7432Lecture: Friday 8-10 CAB [email protected]

2. Assistant:Marcus Wittberger, CAB G65.1, Tel. 632 [email protected] 1: Friday 14-16 IFW D31

3. Assistant:Cyril Flaig, CAB F63.2, Tel. 632 [email protected] 2: Friday 10-12 IFW D31

2


Introduction

What is parallel computing

What is parallel computing [in this course]

A parallel computer is a collection of processors, that cansolve big problems quickly by means of well coordinated col-laboration.

Parallel computing is the use of multiple processors to executedifferent parts of the same program concurrently or simulta-neously.

3


Introduction


An example of parallel computing

Assume you are to sort a deck of playing cards (by suits, then byrank). If you do it properly you can complete this task faster if youhave people that help you (parallel processing).Note that the work done in parallel is not smaller than the workdone sequentially. However, the solution time (wall clock time) isreduced.Notice further that the helpers have to somehow communicatetheir partial results. This causes some overhead. Clearly, there maybu too many helpers (e.g. if there are more than 52). One mayobserve a relation of speedup vs. number of helper as depicted inFig. 1 on the next slide.

4


Introduction


Figure: Sorting a deck of cards5


Introduction

Why parallel computing


I Runtimewant to reduce wall clock time [in e.g. time criticalapplications like weather forecast].

I Memory spaceSome large applications (grand challenges) need a largenumber of degrees of freedom to provide meaningful results.[Reasonably short time step, discretization has to besufficiently fine, c.f. again weather forecast]A large number of small processors probably has a muchbigger (fast) memory than a single large machine (PC clustervs. HP Superdome)

6


Introduction


The challenges of parallel computing

Idea is simple: Connect a sufficient amount of hardware and youcan solve arbitrarily large problems. (Interconnection network forprocessors and memory?)BUT, there are a few problems here...

Let’s look at the processors. By Moore’s law the number oftransistors per square inch doubles every 18 – 24 months, cf. Fig. 2.

Remark: If your problem is not too big you may want to wait untilthere is a machine that is sufficiently fast to do the job.

7


Introduction


Figure: Moore’s law

8


Introduction


9


Introduction


How did this come about?

I Clock rate (+30% / year)I increases power consumption

I Number of transistors (+60-80% / year)I Parallelization at bit levelI Instruction level parallelism (pipelining)I Parallel functional unitsI Dual-core / Multi-core processors

10


Introduction


Instruction level parallelism

Figure: Pipelining of an instruction with 4 subtasks: fetch(F),decode(D), execute(E), write back(W)

11


Introduction


Some superscalar processors

issuable instructions clock rateprocessor max ALU FPU LS B (MHz) year

Intel Pentium 2 2 1 2 1 66 1993Intel Pentium II 3 2 1 3 1 450 1998Intel Pentium III 3 2 1 2 1 1100 1999Intel Pentium 4 3 3 2 2 1 1400 2001AMD Athlon 3 3 3 2 1 1330 2001

Intel Itanium 2 6 6 2 4 3 1500 2004AMD Opteron 3 3 3 2 1 1800 2003

ALU: interger instructions; FPU floating-point instructions;LS: load-store instructions; B: branch instructions;clock rate at time of introduction.[Rauber & Runger: Parallele Programmierung]

12


Introduction


One problem of high performance computing (and of parallelcomputing in particular) is caused by the fact that the access timeto memory has not improved accordingly, see Fig. 4. Memoryperformance doubles in 6 years only [Hennessy & Patterson]

Figure: Memory vs. CPU performance

13


Introduction


To alleviate this problem, memory hierarchies with varying accesstimes have been introduced (several levels of caches).But the further away data are from the processor, the longer theytake to get and store.

Data access is everything in determining performance!

Sources of performance losses that are specific to parallelcomputing are

I communication overhead: synchronization, sending messages,etc.(Data is not only in the processor’s own slow memory, buteven on a remote processor’s own memory.)

I unbalanced loads: the different processors do not have thesame amount of work to do.

14


Outline of the lecture

Outline of the lecture

I Overview of parallel programming, Terminology

I SIMD programming on the Pentium (Parallel computers arenot just in RZ, most likely there is one in your backpack!)

I Shared memory programming, OpenMP

I Distributed memory programming, Message Passing Interface(MPI)

I Solving dense systems of equations with ScaLAPACK

I Solving sparse systems iteratively with Trilinos

I Preconditioning, reordering (graph partitioning with METIS),parallel file systems

I Fast Fourier Transform (FFT)

I Applications: Particle methods, (bone) structure analysis

For details see PARCO home page.15


Exercises

Exercises’ Objectives

1. To study 3 modes of parallelismI Instruction level (chip level, board level, ...). SIMDI Shared memory programming on ETH compute server. MIMDI Distributed memory programming on Linux cluster. MIMD

2. Several computational areas will be studiedI Linear algebra (BLAS, iterative methods)I FFT and related topics (N-body simulation)

3. Models and programming (Remember Portability!)I Examples will be in C/C++ (calling Fortran routines)I OpenMP (HP Superdome Stardust/Pegasus)I MPI (Opteron/Linux cluster Gonzales)

4. We expect you to solve 6 out of 8 exercises.

16


References

References

1. P. S. Pacheco: Parallel programming with MPI. MorganKaufmann, San Francisco CA 1997.http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-339-5

2. R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J.McDonald: Parallel programming in OpenMP. MorganKaufmann, San Francisco CA 2001.http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-671-8

3. W. P. Petersen and P. Arbenz: Introduction to ParallelComputing, Oxford Univ. Press, 2004.http://www.oup.co.uk/isbn/0-19-851577-4

Complementary literature is found on the PARCO home page.

17

http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-339-5




http://www.oup.co.uk/isbn/0-19-851577-4


Flynn’s Taxonomy of Parallel Systems


In the taxonomy of Flynn parallel systems are classified accordingto the number of instruction streams and data streams.

M. Flynn: Proc. IEEE 54 (1966), pp. 1901–1909.18



SISD: Single Instruction stream - Single Data stream

The classical von Neumann machine.I processor: ALU, registers

I memory holds data and program

I bus (a collection of wires) = von Neumann bottleneck

Today’s PCs or workstations are no longer true von Neumannmachines (superscalar processors, pipelining, memory hierarchies)

19



SIMD: Single Instruction stream - Multiple Data stream

SIMD: Single Instruction stream - Multiple Data stream

During each instruction cycle the central control unit broadcastsan instruction to the subordinate processors and each of themeither executes the instruction or is idle.At any given time a processor is “active” executing exactly thesame instruction as all other processors in a completelysynchronous way, or it is idle.

20



Example SIMD machine: Vector computers


Vector computers were a kind of SIMD parallel computer. Vectoroperations on machines like the Cray-1, -2, X-MP, Y-MP,. . .worked essentially in 3 steps:

1. copy data (like vectors of 64 floating point numbers) into thevector register(s)

2. apply the same operation to all the elements in the vectorregister(s)

3. copy the result from the vector register(s) to the main memory

These machines did not have a cache but a very fast memory.(Some people say that they only had a cache. But there were nocachelines anyway.)The above three steps could overlap: “the pipelines could bechained”.

21




A variant of SIMD: a pipeline

Complicated operations often take more than one cycle tocomplete. If such an operation can be split in several stages thateach take one cycle, then a pipeline can (after a startup phase)produce a result in each clock cycle.Example: elementwise multiplication of 2 integer arrays of length n.

c = a. ∗ b⇐⇒ ci = ai ∗ bi , 0 ≤ i < n.

Let the numbers ai , bi , ci be split in four fragments (bytes):

ai = [ai3, ai2, ai1, ai0]bi = [bi3, bi2, bi1, bi0]ci = [ci3, ci2, ci1, ci0]

Thenci ,j = ai ,j ∗ bi ,j + carry from ai ,j−1 ∗ bi ,j−1

This gives rise to a pipeline with four stages.22




23



Example SIMD machine: Pentium 4


I The Pentium III and Pentium IV support SIMD programmingby means of their Streaming SIMD extensions (SSE, SSE2).

I The Pentium III has vector registers, called Multimedia orMMX Registers. There are 8 (eight) of them! They are 64 bitwide and were intended for computations with integer arrays.

I The Pentium 4 additionally has 8 XMM registers that are 128bit wide.

I The registers are configurable. They support integer andfloating point operations. (XMM also supports double (64bit) operations.)

I The registers can be considered vector registers. Although theregisters are very short they mean the return of vectorcomputing on the desktop.

I We will investigate how these registers can be used and whatwe can expect in terms of performance.

24




25



Example SIMD machine: Cell processor


I IBM in collaboration with Sony and Toshiba

I Playstation 3

I Multicore processor: one Power Processing Element (PPE)

I 8 SIMD co-processors (SPE’s)

I 4 FPUs (32 bit), 4 GHz clock, 32 GFlops per SPE

I 1/4 TFlop per Cell, 80 Watt

I SPE’s programmed with compiler instrinsics

Next page: EIB = element interface bus; LS = local store; MIC =memory interface controller; BIC = bus interface controller

26




DualXDR

16B/Zyklus

SPU SPU SPU SPU SPU SPU SPU

LSLSLSLSLSLSLSLS

EIB (bis 96 B/Zyklus)

64−Bit Power Architektur RRAC I/O

Synergetic Processing Elements

SPU

MIC BIC

PPU L1

L2

[Rauber & Runger: Parallele Programmierung]27




28



MIMD: Multiple Instruction stream - Multiple Data stream

MIMD: Multiple Instruction stream - Multiple Data stream

Each processor can execute its own instruction stream on its owndata independently from the other processors. Each processor is afull-fledged CPU with both control unit and ALU. MIMD systemsare asynchronous.

29


Memory organization

shared memory machines

Memory organization

Most parallel machines are MIMD machines.MIMD machines are classified by their memory organization

I shared memory machines (multiprocessors)I parallel processes, threadsI communication by means of shared variablesI data dependencies possible, race conditionI multi-core processors

30


Memory organization

shared memory machines

Interconnection network

I network usually dynamic: crossbar switch

Crossbar switch with n processors and m memory modules.On the right the possible switch states.

I uniform access, scalable, very many wires ⇐ very expensive,used for only limited number of processors.

31


Memory organization

distributed memory machines

I distributed memory machines (multicomputers)I all data are local to some processor,I programmer responsible for data placementI communication by message passingI easy / cheap to build −→ (Beowulf) clusters

32


Memory organization

distributed memory machines

Interconnection network

I network usually static: Array, ring, meshes, tori, hypercubes

I processing elements usually connected to network throughrouters. Routers can pipeline messages.

33


Examples of MIMD machines

HP Superdome

HP Superdome (Stardust / Pegasus)

The HP Superdome systems are large multi-purpose parallelcomputers. They serve as the Application Servers at ETH.For information seehttp://www.id.ethz.ch/services/list/comp_zentral/

Figure: HP Superdome Stardust (3 cabinets left) and Pegasus (right)34

http://www.id.ethz.ch/services/list/comp_zentral/



HP Superdome

Superdome specifications

I Stardust: 64 Itanium-2 (1,6GHz) dual-core processors, 256GBmain memory

I Pegasus: 32 Itanium-2 (1,5GHz) dual-core processors, 128GBmain memory

I HP/UX (Unix)I Shared memory programming modelI 4-processor cells are connected through crossbarI ccNUMA: Cache-coherent, Non-Uniform Memory AccessI Organisation: batch processing. Jobs are submitted to LSF

(Load Sharing Facility)Interactive access possible on Pegasus.

I System manager: [email protected] We will use Pegasus for experiments with shared memory

programming with C and compiler directives (OpenMP).

35



(Speedy) Gonzales

Gonzales Cluster

(Speedy) Gonzales is a high-performance Linux cluster based on288 dual-processor 64-bit AMD Opteron 250 processors and aQuadrics QsNet II interconnect.

Figure: An image of the old Linux cluster Asgard

36



(Speedy) Gonzales

Cluster specifications

I One master node, two login nodes (Gonzales) and three fileservers, of 288 compute nodes.

I 1 node = two 64-bit AMD Opteron 2.4 GHz processors, 8 GBof main memory (shared by the two processors).Globale view: distributed memory

I All nodes connected through Gb-Ethernet switch (NFS andother services).

I Compute nodes inter-connected via a two-layer QuadricsQsNet II network. Sustained bandwidth 900 MB/s. Latency1µsec between any two nodes in the cluster.

37



(Speedy) Gonzales

Figure: Gonzales fat tree topology

Each 64-way switch is based on a 3-stage fat-tree topology. Thetop-level switch adds another 4th stage to this fat-tree.http://clusterwiki.ethz.ch/wiki/index.php/Gonzales.

38

http://clusterwiki.ethz.ch/wiki/index.php/Gonzales



(Speedy) Gonzales

I Nodes run SuSE Linux (64-bit) with some modifications forthe Quadrics interconnect.

I The login nodes have a more or less complete linux system,including compilers, debuggers, etc., while the compute nodeshave a minimal system with only the commands and librariesnecessary to run applications.

I The AMD Opteron runs both 32-bit and 64-bit applications.

I Compilers: C/C++, Fortran 77/90 & HPF.

I Note: all parallel applications must be recompiled (in 64-bit)and linked with the optimized MPI library from Quadrics.

I Jobs are submitted from the login nodes to the compute nodesvia the LSF batch system, exclusively. Users are not allowedto login or execute remote commands on the compute nodes.

I System manager: [email protected]

39



Cray XT-3 at CSCS in Manno


I The Cray XT3 is based on 1664 2,6 GHz AMD Opteronsingle-core processors, that are connected by the Cray SeaStarhigh speed network.

I The computer’s peak performance is 8.7 Tflop/s.

I Names: The XT3 is called “Red Storm”. The actual machinein Manno is called horizon.

I In number 94 on the list of the top 500 fastest machines (7.2Tflop/s).http://www.top500.org/list/2006/11/100Behind 2 Blue Gene (EPFL, IBM Research) and Intel Cluster(BMW Sauber)

40

http://www.top500.org/list/2006/11/100




Cray XT3 supercomputer is called “Red Storm” in the US. TheCSCS model has been babtized “Horizon”.

41




SeaStar router: The high-speed interconnection network exchangesdata six neighbouring knots in a 3D-torus topology.

42




Bone structure analysis

Computation of stresses in loaded human bone. FE applicationwith 1.2 · 109 degrees of freedom.

43



IBM Blue Gene BG/L

IBM Blue Gene BG/L

I Presently fastest parallel computer.I Blue Gene/L at Lawrence Livermore National Laboratory has

16 Racks (65’536 nodes, 131’072 processors): 280 TFlop/sI Simple, cheap, processors (PPC440), moderate cycle time

(700MHz), high performance / Watt, small main memory(512MG/node)

I 5 NetworksI 3D - torus for point-to-point messages (bandwidth 1.4 Gb/s,

latency < 6.4µs)I broadcast network for global communication, in particular

reduction operations (bandwidth 2.8 Gb/s, latency 5µs)I barrier network for global synchronization (latency 1.5µs)I control network for checking system components (temperature,

fan, . . .)I Gb-Ethernet connects I/O - nodes with external data storage.

44



IBM Blue Gene BG/L

IBM Blue Gene BG/L

L2−P

refe

tch−

Puf

fer

L2−P

refe

tch−

Puf

fer

eingebettetes

SharedL3

directoryfür

DRAM

mitError

CorrectionControl(ECC)M

ulti−

Por

t sha

red

SR

AM

−Puf

fer

DRAM

4MBeingebettetes

L3 Cacheoder

Speicher

Pro

zess

or−B

us

2.

7 G

B/s

128

128

5.5

GB

/sS

noop

256

256

25611 GB/s

256

128

+144 ECC

22 GB/s

1024

5.5 GB/s6 out, 6 inmit je

1.4 GB/s

3 out, 3 inmit je

2.8 GB/s

GbitEthernet Netzwerk

Kontroll−Netzwerk

Torus− BroadcastNetzwerk Netzwerk

Barrier−ControllerMemory−

PPC 440

32K/32K L1

FPUDouble−Hummer

PPC 440

32K/32K L1

FPUDouble−Hummer

45



IBM Blue Gene BG/L

IBM Blue Gene BG/L at Lawrence Livermore NL

46

Documents

NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/files/1.pdf · Example SIMD machine: Vector computers A variant of SIMD: a pipeline Complicated operations often take more