High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

High Performance Computing –Programming Paradigms and Scalability

Part 1: Introduction

PD Dr. rer. nat. habil. Ralf-Peter MundaniComputation in Engineering (CiE)

Scientific Computing (SCCS)

Summer Term 2015


PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 12

General Remarks Ralf-Peter Mundani

email: [email protected], phone: 289–25057, room: 3181 consultation-hour: by appointment lecture: Tuesday, 12:00—13:30, room 02.07.023

Christoph Riesinger email: [email protected] exercise: Wednesday, 10:15—11:45, room 02.07.023 (fortnightly)

examination written, 90 minutes all printed/written materials allowed (no electronic devices)

materials: http:www5.in.tum.de



General Remarks content

part 1: introduction part 2: high-performance networks part 3: foundations part 4: shared-memory programming part 5: distributed-memory programming part 6: examples of parallel algorithms



Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation

If one ox could not do the job they did not tryto grow a bigger ox, but used two oxen.

—Grace Murray Hopper



Motivation numerical simulation: from phenomena to predictions

physical phenomenontechnical process 1. modelling

determination of parameters, expression of relations

2. numerical treatmentmodel discretisation, algorithm development

3. implementationsoftware development, parallelisation

4. visualisationillustration of abstract simulation results

5. validationcomparison of results with reality

6. embeddinginsertion into working process

mathematics

computer science

application

discipline



Motivation why numerical simulation?

because experiments are sometimes impossible life cycle of galaxies, weather forecast, terror attacks, e.g.

because experiments are sometimes not welcome avalanches, nuclear tests, medicine, e.g.

bomb attack on WTC (1993)



Motivation why numerical simulation? (cont’d)

because experiments are sometimes very costly & time consuming protein folding, material sciences, e.g.

because experiments are sometimes more expensive aerodynamics, crash test, e.g.

Mississippi basin model (Jackson, MS)



Motivation why parallel programming and HPC?

complex problems (especially the so called “grand challenges”) demand for more computing power climate or geophysics simulation (tsunami, e.g.) structure or flow simulation (crash test, e.g.) development systems (CAD, e.g.) large data analysis (Large Hadron Collider at CERN, e.g.) military applications (crypto analysis, e.g.)

performance increase due to faster hardware, more memory (“work harder”) more efficient algorithms, optimisation (“work smarter”) parallel computing (“get some help”)



Motivation objectives (in case all resources would be available N-times)

throughput: compute N problems simultaneously running N instances of a sequential program with different data sets

(“embarrassing parallelism”); SETI@home, e.g. drawback: limited resources of single nodes

response time: compute one problem at a fraction (1N) of time running one instance (i. e. N processes) of a parallel program for

jointly solving a problem; finding prime numbers, e.g. drawback: writing a parallel program; communication

problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using

the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e.g.

drawback: writing a parallel program; communication



Motivation levels of parallelism

qualitative meaning: level(s) on which work is done in parallel

gran

ular

ity

sub-instruction level

instruction level

block level

process level

program level



Motivation levels of parallelism (cont’d)

program level parallel processing of different programs independent units without any shared data organised by the OS

process level a program is subdivided into processes to be executed in

parallel each process consists of a larger amount of sequential

instructions and some private data communication in most cases necessary (data exchange, e.g.) term of process often referred to as heavy-weight process



Motivation levels of parallelism (cont’d)

block level blocks of instructions are executed in parallel each block consists of few instructions and shares data with others communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread)

instruction level parallel execution of machine instructions optimising compilers can increase this potential by modifying the order

of commands

sub-instruction level instructions are further subdivided in units to be executed in parallel or

via overlapping (vector operations, e.g.)






Hardware Excursion definition of parallel computers

“A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989)

possible appearances of such processing elements specialised units (steps of a vector pipeline, e.g.) parallel features in modern monoprocessors (instruction pipelining,

superscalar architectures, VLIW, multithreading, multicore, …) several uniform arithmetical units (processing elements of array

computers, GPGPUs, accelerators e.g.) complete stand-alone computers connected via LAN (work station

or PC clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called

metacomputers)



Hardware Excursion instruction pipelining

instruction execution involves several operations1.instruction fetch (IF)2.decode (DE)3.fetch operands (OP)4.execute (EX)5.write back (WB)

which are executed successively

hence, only one part of CPU works at a given moment

IF DE OP EX WB IF DE OP EX WB ……

instruction N instruction N1



Hardware Excursion instruction pipelining (cont‘d)

observation: while processing particular stage of instruction, other stages are idle

hence, multiple instructions to be overlapped in execution instruction pipelining (similar to assembly lines)

advantage: no additional hardware necessary

instruction N IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WBIF DE OP EX WB

IF DE OP EX WB

…

…instruction N1

instruction N2

instruction N3

instruction N4

time



Hardware Excursion superscalar

faster CPU throughput due to simultaneously execution of instructions within one clock cycle via redundant functional units (ALU, multiplier, …)

dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional units

for instance, PowerPC 970 (4 ALU, 2 FPU)

but, performance improvement is limited (intrinsic parallelism)

ALU

inst

r. 1

ALU

inst

r. 2

ALU

inst

r. 3

ALU

inst

r. 4

FPU

inst

r. A

FPU

inst

r. B



Hardware Excursion superscalar (cont’d)

pipelining for superscalar architectures also possible

instruction N9

instruction Ninstruction N1

…

IF DE OP EX WB

IF DE OP EX WB timeIF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB



Hardware Excursion very long instruction word (VLIW)

in contrast to superscalar architectures, the compiler groups parallel executable instructions during compilation (pipelining still possible)

advantage: no additional hardware logic necessary drawback: not always fully useable ( dummy filling (NOP))

VLIW instruction

inst

r. 1

inst

r. 4

inst

r. 3

inst

r. 2

registers



Hardware Excursion vector units

simultaneously execution of one instruction on a one-dimensional array of data ( vector)

VU first appeared in 1970s and were the basis of most supercomputers in the 1980s and 1990s

specialised hardware very expensive limited application areas (mostly CFD, CSD, …)

instruction1 2 3 N1 N

( A1 B1 A2 B2 A3 B3 AN1 BN1 AN BN )T

( C1 C2 C3 CN1 CN )T



Hardware Excursion dual core, quad core, many core, and multicore

observation: increasing frequency f (and thus core voltage v) over past years problem: thermal power dissipation P fv2



Hardware Excursion dual core, quad core, many core, and multicore (cont’d)

25% reduction in performance (i.e. core voltage) leads to approx. 50% reduction in dissipation

dissipation

performance

normal CPU reduced CPU




idea: installation of two cores per die with same dissipation assingle core system

dissipation

performance

single core

dual core




single vs. dual quad

FSB

core 0

L1

L2

FSB

core 0 core 1

L1 L1

shared L2

FSB

core 0 core 1

L1 L1

shared L2

core 2 core 3

L1 L1

shared L2

FSB: front side bus (i.e. connection to memory (via north bridge))



Hardware Excursion INTEL Nehalem Core i7

source: www.samrathacks.comQPI

core 0 core 1

L1L2 L1L2

shared L3

core 2 core 3

L1L2 L1L2

QPI: QuickPath Interconnect replaces FSB (QPI is a point-to-point interconnection – with a memory controller now on-die – in order to allow both reduced latency and higher bandwidth up to (theoretically) 25.6 GBytes data transfer, i.e. 2 FSB)



Hardware Excursion Intel E5-2600 Sandy-Bridge Series

2 CPUs connected by 2 QPIs (Intel Quick Path Interconnect) Quick Path Interconnect (1 sending and 1 receiving port)

8 GT/s ∙ 16 Bit/T payload ∙ 2 directions / 8 Bit/Byte = 32 GB/s max bandwidth / QPI

2 QPI links: 2 ∙ 32 GB/s 64 GB/s max bandwidth

source: G. Wellein, RRZE






Supercomputers arrival of clusters

in the late eighties, PCs became a commodity market with rapidlyincreasing performance, mass production, and decreasing prices

growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of

commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MPI library

1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops



Supercomputers supercomputers

supercomputing or high-performance scientific computing as the most important application of the big number crunchers

national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in 199293 decision: develop, build, and install a series of five

supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National

Laboratory, the world’s first TFlops computer) then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain,

ASCI White,

meanwhile new high-end computing memorandum (2004)



Supercomputers supercomputers (cont’d)

federal “Bundeshöchstleistungsrechner” initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich,

Stuttgart, and Jülich) one new installation every second year (i.e. a six year upgrade

cycle for each centre) the newest one to be among the top 10 of the world

overview and state of the art: Top500 list (updated every six month), see http:www.top500.org

finally (a somewhat different definition)Supercomputer: Turns CPU-bound problems into I/O-bound problems.

—Ken Batcher



Supercomputers MOORE’s law

observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965)

number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every eighteen months



Supercomputers some numbers: Top500



Supercomputers some numbers: Top500 (cont’d)

Citius, altius, fortius!

Citius, altius, fortius!



Supercomputers some numbers: Top500 (cont’d)



Supercomputers the 10 fastest supercomputers in the world (by November 2014)



Supercomputers The Earth Simulator – world’s #1 from 2002—04

installed in 2002 in Yokohama, Japan ES-building (approx. 50m 65m 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching)

8 vector processors (8 GFlops each) 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; 35.86 TFlops sustained performance (Linpack)

nodes connected by 640640 single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth)

further 700 TB disc space and 1.60 PB mass storage



Supercomputers BlueGeneL – world’s #1 from 2004—08

installed in 2005 at LLNL, CA, USA(beta-system in 2004 at IBM)

cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 IO nodes)

2 PowerPC 440d processors (2.8 GFlops each) 512MB memory 131,072 processors (367.00 TFlops peak performance) and33.50 TB memory; 280.60 TFlops sustained performance (Linpack)

nodes configured as 3D torus (32 32 64); global reduction tree for fast operations (global max sum) in a few microseconds

1024 Gbps link to global parallel file system further 806 TB disc space; operating system SuSE SLES 9



Supercomputers Roadrunner – world’s #1 from 2008—09

installed in 2008 at LANL, NM, USA installation costs about $120 million first “hybrid” supercomputer

dual-core Opteron Cell Broadband Engine 129,600 cores (1456.70 TFlops peak performance) and98 TB memory; 1144.00 TFlops sustained performance (Linpack)

standard processing (file system IO, e. g.) handled by Opteron, while mathematically and CPU-intensive tasks are handled by Cell

2.35 MW power consumption ( 437 MFlops per Watt ) primarily usage: ensure safety and reliability of nation’s nuclear

weapons stockpile, real-time applications (cause & effect in capital markets, bone structures and tissues renderings as patients are being examined, e.g.)



Supercomputers HLRB II (world’s #6 for 042006)

installed in 2006 at LRZ, Garching installation costs 38M€ monthly costs approx. 400,000€ upgrade in 2007 (finished) one of Germany’s 3 supercomputers SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus)

256 blades (ccNUMA link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.80 GFlops) 4GB memory per core

9728 cores (62.30 TFlops peak performance) and 39 TB memory; 56.50 TFlops sustained performance (Linpack)

footprint 24m 12m; total weight 103 metric tons



Supercomputers SuperMUC (world’s #4 for 062012)

installed in 2012 at LRZ, Garching IBM System x iDataPlex (still) one of Germany’s 3 supercomputers consists of 19 islands (Infiniband FDR10 pruned tree with

4:1 intra-island / inter-island ratio) 18 thin islands with 512 nodes each (total 288 TB memory) Sandy Bridge-EP Xeon E5 (2 CPUs (8 cores each) / node)

1 fat island with 205 nodes (total 52 TB memory) Westmere-EX Xeon E7 (4 CPUs (10 cores each) / node)

147,456 cores (3.185 PFlops peak performance – thin islands only); 2.897 PFlops sustained performance (Linpack)

footprint 21m 26m; warm water cooling






Classification of Parallel Computers standard classification according to FLYNN

global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams

two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process

hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data

drawback: very different computers may belong to the same class



Classification of Parallel Computers standard classification according to FLYNN (cont’d)

SISD one processing unit that has access to one data memory and to

one program memory classical monoprocessor following VON NEUMANN’s principle

processor program memorydata memory




SIMD several processing units, each with separate access to a (shared or

distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict

synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes out-

dated due to recent developments at commodity market

processor

program memory

data memory

data memory processor




MISD several processing units that have access to one data memory;

several program memories not very popular class (mainly for special applications such as Digital

Signal Processing) operating on a single stream of data, forwarding results from one

processing unit to the next example: systolic array (network of primitive processing elements that

“pump” data)

processor program memory

data memory

processor program memory




MIMD several processing units, each with separate access to a (shared or

distributed) data memory; several program memories classification according to (physical) memory organisation

shared memory shared (global) address space distributed memory distributed (local) address space

example: multiprocessor systems, networks of computers

processor program memorydata memory

data memory processor program memory



Classification of Parallel Computers

MesMSdistributed address space

Mem-MesMS (hybrid)MemMS, SMPsharedaddress space

distributed memoryglobal memory

processor coupling cooperation of processors computers as well as their shared use

of various resources require communication and synchronisation the following types of processor coupling can be distinguished

memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS)



Classification of Parallel Computers processor coupling (cont’d)

uniform memory access (UMA) each processor P has direct access via the network to each

memory module M with same access times to all data standard programming model can be used (i.e. no explicit send receive of messages necessary)

communication and synchronisation via shared variables (inconsistencies (write conflicts, e.g.) have to prevented in general by the programmer)

M

network

P PP

M M




symmetric multiprocessor (SMP) only a small amount of processors, in most cases a central bus,

one address space (UMA), but bad scalability cache-coherence implemented in hardware (i.e. a read always

provides a variable’s value from its last write) example: double or quad boards, SGI Challenge

M

C

P

CC

PP

C: cache




non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of

data (i.e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM VSM, Cray T3E

P

M

network

M

P




cache-coherent non-uniform memory access (ccNUMA) caches for local and remote addresses; cache-coherence

implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000

C

M

network

P

M

C

P




cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1

P

C

network

CC

PP




no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange

(due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and IO due to

parallel data transfer (Direct Memory Access, e.g.) possible example: IBM SP2, ASCI Red Blue White

M

P

network

M

P

M

P



Classification of Parallel Computers difference between processes and threads

program (*.exe, *.out, e.g.)

messages

messagesproc

ess

mod

el(N

OR

MA

)

program (*.exe, *.out, e.g.)

thre

ad m

odel

(UM

A, N

UM

A)






Quantitative Performance Evaluation execution time

time T of a parallel program between start of the execution on one processor and end of all computations on the last processor

during execution all processors are in one of the following states compute TCOMP: time spent for computations

communicate TCOMM: time spent for send and receive operations

idle TIDLE: time spent for waiting (sending receiving messages)

hence T TCOMP TCOMM TIDLE



Quantitative Performance Evaluation comparison multiprocessor monoprocessor

correlation of multi- and monoprocessor systems’ performance important: program that can be executed on both systems definitions

P(1): amount of unit operations of a program on the monoprocessor system

P(p): amount of unit operations of a program on the multiprocessor systems with p processors

T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles)

T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors



Quantitative Performance Evaluation comparison multiprocessor monoprocessor (cont’d)

simplifying preconditions T(1) P(1) one operation to be executed in one step on the

monoprocessor system

T(p) P(p) more than one operation to be executed in one step

(for p 2) on the multiprocessor system with p processors



comparison multiprocessor monoprocessor (cont’d) speed-up

S(p) indicates the improvement in processing speed

efficiency E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p

Quantitative Performance Evaluation

)(

(1))(

pT

TpS with 1 S(p) p

p

pSpE

)()( with 1p E(p) 1



Quantitative Performance Evaluation comparison multiprocessor monoprocessor (cont’d)

speed-up and efficiency can be seen in two different ways algorithm-independent

best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency

algorithm-dependent parallel algorithm is treated as sequential one to measure the

execution time on the monoprocessor system; “unfair” due to communication and synchronisation overhead relative speed-up relative efficiency



Quantitative Performance Evaluation scalability

objective: adding further processing elements to the system shall reduce the execution time without any program modifications

i. e. a linear performance increase with an efficiency close to 1 important for the scalability is a sufficient problem size

one porter may carry one suitcase in a minute 60 porters won’t do it in a second but 60 porters may carry 60 suitcases in a minute

in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited

when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems



AMDAHL’s law the probably most important and most famous estimate for the

speed-up (even if quite pessimistic) underlying model

each program has a sequential part s, 0 s 1, that can only be executed in a sequential way: synchronisation, data IO, …

furthermore, each program consists of a parallelisable part 1sthat can be executed in parallel by several processes; finding the maximum value within a set of numbers, e.g.

hence, the execution time for the parallel program executed on pprocessors can be written as


(1)1

(1))( Tp

sTspT



AMDAHL’s law (cont’d) the speed-up can thus be computed as

when increasing p we finally get AMDAHL’s law

speed-up is bounded: S(p) 1s the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s 0.1)


p

ssT

p

sTs

T

pT

T

11

(1)

1(1)

(1)

)(

(1) )(pS

sp

ss

pSpp

1

11

lim )(lim




0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

# processes

speed

-up

S(p)

AMDAHL’s law (cont’d) example: s 0.1

independent from p the speed-up is bounded by this limit where’s the error?



Quantitative Performance Evaluation GUSTAFSON’s law

addresses the shortcomings of AMDAHL’s law as it states that any sufficient large problem can be efficiently parallelised

instead of a fixed problem size it supposes a fixed time concept underlying model

execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part , 0 1

hence, the execution time for the sequential program on the monoprocessor can be written as

T(1) p(1)

the speed-up can thus be computed as

S(p) p(1) p (1p)



GUSTAFSON’s law (cont’d) difference to AMDAHL

sequential part s(p) is not constant, but gets smaller with increasing p

s(p) 0, 1

often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, )

speed-up is not bounded for increasing p


, )(1

)(

pps



Quantitative Performance Evaluation GUSTAFSON’s law (cont’d)

some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a

monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system

but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main

memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system



Quantitative Performance Evaluation communication—computation-ratio (CCR)

important quantity measuring the success of a parallelisation relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size

example NN matrix distributed among p processors (Np rows each) iterative method: in each step, each matrix element is replaced

by the average of its eight neighbour values hence, the two neighbouring rows are always necessary computation time: 8NNp communication time: 2N CCR: p4N – what does this mean?



Twelve ways……to fool the masses when giving performance results on parallel computers.

—David H. Bailey,NASA Ames Research Centre, 1991

1. Quote only 32-bit performance results, not 64-bit results.

2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application.

3. Quietly employ assembly code and other low-level language constructs.

4. Scale up the problem size with the number of processors, but omit any mention of this fact.

5. Quote performance results projected to a full system.

6. Compare your results against scalar, unoptimised codes on Crays.



Twelve ways…7. When direct run time comparisons are required, compare with an old code on

an obsolete system.

8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation.

9. Quote performance in terms of processor utilisation, parallel speed-ups or MFLOPS per dollar.

10. Mutilate the algorithm used in the parallel implementation to match the architecture.

11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.

12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance.

Documents

High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil