Upload
lythien
View
220
Download
0
Embed Size (px)
Citation preview
Technische Universität München
High Performance Computing –Programming Paradigms and Scalability
Part 1: Introduction
PD Dr. rer. nat. habil. Ralf-Peter MundaniComputation in Engineering (CiE)
Scientific Computing (SCCS)
Summer Term 2015
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 12
General Remarks Ralf-Peter Mundani
email: [email protected], phone: 289–25057, room: 3181 consultation-hour: by appointment lecture: Tuesday, 12:00—13:30, room 02.07.023
Christoph Riesinger email: [email protected] exercise: Wednesday, 10:15—11:45, room 02.07.023 (fortnightly)
examination written, 90 minutes all printed/written materials allowed (no electronic devices)
materials: http:www5.in.tum.de
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 13
General Remarks content
part 1: introduction part 2: high-performance networks part 3: foundations part 4: shared-memory programming part 5: distributed-memory programming part 6: examples of parallel algorithms
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 14
Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation
If one ox could not do the job they did not tryto grow a bigger ox, but used two oxen.
—Grace Murray Hopper
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 15
Motivation numerical simulation: from phenomena to predictions
physical phenomenontechnical process 1. modelling
determination of parameters, expression of relations
2. numerical treatmentmodel discretisation, algorithm development
3. implementationsoftware development, parallelisation
4. visualisationillustration of abstract simulation results
5. validationcomparison of results with reality
6. embeddinginsertion into working process
mathematics
computer science
application
discipline
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 16
Motivation why numerical simulation?
because experiments are sometimes impossible life cycle of galaxies, weather forecast, terror attacks, e.g.
because experiments are sometimes not welcome avalanches, nuclear tests, medicine, e.g.
bomb attack on WTC (1993)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 17
Motivation why numerical simulation? (cont’d)
because experiments are sometimes very costly & time consuming protein folding, material sciences, e.g.
because experiments are sometimes more expensive aerodynamics, crash test, e.g.
Mississippi basin model (Jackson, MS)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 18
Motivation why parallel programming and HPC?
complex problems (especially the so called “grand challenges”) demand for more computing power climate or geophysics simulation (tsunami, e.g.) structure or flow simulation (crash test, e.g.) development systems (CAD, e.g.) large data analysis (Large Hadron Collider at CERN, e.g.) military applications (crypto analysis, e.g.)
performance increase due to faster hardware, more memory (“work harder”) more efficient algorithms, optimisation (“work smarter”) parallel computing (“get some help”)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 19
Motivation objectives (in case all resources would be available N-times)
throughput: compute N problems simultaneously running N instances of a sequential program with different data sets
(“embarrassing parallelism”); SETI@home, e.g. drawback: limited resources of single nodes
response time: compute one problem at a fraction (1N) of time running one instance (i. e. N processes) of a parallel program for
jointly solving a problem; finding prime numbers, e.g. drawback: writing a parallel program; communication
problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using
the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e.g.
drawback: writing a parallel program; communication
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 110
Motivation levels of parallelism
qualitative meaning: level(s) on which work is done in parallel
gran
ular
ity
sub-instruction level
instruction level
block level
process level
program level
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 111
Motivation levels of parallelism (cont’d)
program level parallel processing of different programs independent units without any shared data organised by the OS
process level a program is subdivided into processes to be executed in
parallel each process consists of a larger amount of sequential
instructions and some private data communication in most cases necessary (data exchange, e.g.) term of process often referred to as heavy-weight process
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 112
Motivation levels of parallelism (cont’d)
block level blocks of instructions are executed in parallel each block consists of few instructions and shares data with others communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread)
instruction level parallel execution of machine instructions optimising compilers can increase this potential by modifying the order
of commands
sub-instruction level instructions are further subdivided in units to be executed in parallel or
via overlapping (vector operations, e.g.)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 113
Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 114
Hardware Excursion definition of parallel computers
“A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989)
possible appearances of such processing elements specialised units (steps of a vector pipeline, e.g.) parallel features in modern monoprocessors (instruction pipelining,
superscalar architectures, VLIW, multithreading, multicore, …) several uniform arithmetical units (processing elements of array
computers, GPGPUs, accelerators e.g.) complete stand-alone computers connected via LAN (work station
or PC clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called
metacomputers)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 115
Hardware Excursion instruction pipelining
instruction execution involves several operations1.instruction fetch (IF)2.decode (DE)3.fetch operands (OP)4.execute (EX)5.write back (WB)
which are executed successively
hence, only one part of CPU works at a given moment
IF DE OP EX WB IF DE OP EX WB ……
instruction N instruction N1
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 116
Hardware Excursion instruction pipelining (cont‘d)
observation: while processing particular stage of instruction, other stages are idle
hence, multiple instructions to be overlapped in execution instruction pipelining (similar to assembly lines)
advantage: no additional hardware necessary
instruction N IF DE OP EX WB
IF DE OP EX WB
IF DE OP EX WBIF DE OP EX WB
IF DE OP EX WB
…
…instruction N1
instruction N2
instruction N3
instruction N4
time
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 117
Hardware Excursion superscalar
faster CPU throughput due to simultaneously execution of instructions within one clock cycle via redundant functional units (ALU, multiplier, …)
dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional units
for instance, PowerPC 970 (4 ALU, 2 FPU)
but, performance improvement is limited (intrinsic parallelism)
ALU
inst
r. 1
ALU
inst
r. 2
ALU
inst
r. 3
ALU
inst
r. 4
FPU
inst
r. A
FPU
inst
r. B
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 118
Hardware Excursion superscalar (cont’d)
pipelining for superscalar architectures also possible
instruction N9
instruction Ninstruction N1
…
IF DE OP EX WB
IF DE OP EX WB timeIF DE OP EX WB
IF DE OP EX WB
IF DE OP EX WB
IF DE OP EX WB
IF DE OP EX WB
IF DE OP EX WB
IF DE OP EX WB
IF DE OP EX WB
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 119
Hardware Excursion very long instruction word (VLIW)
in contrast to superscalar architectures, the compiler groups parallel executable instructions during compilation (pipelining still possible)
advantage: no additional hardware logic necessary drawback: not always fully useable ( dummy filling (NOP))
VLIW instruction
inst
r. 1
inst
r. 4
inst
r. 3
inst
r. 2
registers
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 120
Hardware Excursion vector units
simultaneously execution of one instruction on a one-dimensional array of data ( vector)
VU first appeared in 1970s and were the basis of most supercomputers in the 1980s and 1990s
specialised hardware very expensive limited application areas (mostly CFD, CSD, …)
instruction1 2 3 N1 N
( A1 B1 A2 B2 A3 B3 AN1 BN1 AN BN )T
( C1 C2 C3 CN1 CN )T
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 121
Hardware Excursion dual core, quad core, many core, and multicore
observation: increasing frequency f (and thus core voltage v) over past years problem: thermal power dissipation P fv2
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 122
Hardware Excursion dual core, quad core, many core, and multicore (cont’d)
25% reduction in performance (i.e. core voltage) leads to approx. 50% reduction in dissipation
dissipation
performance
normal CPU reduced CPU
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 123
Hardware Excursion dual core, quad core, many core, and multicore (cont’d)
idea: installation of two cores per die with same dissipation assingle core system
dissipation
performance
single core
dual core
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 124
Hardware Excursion dual core, quad core, many core, and multicore (cont’d)
single vs. dual quad
FSB
core 0
L1
L2
FSB
core 0 core 1
L1 L1
shared L2
FSB
core 0 core 1
L1 L1
shared L2
core 2 core 3
L1 L1
shared L2
FSB: front side bus (i.e. connection to memory (via north bridge))
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 125
Hardware Excursion INTEL Nehalem Core i7
source: www.samrathacks.comQPI
core 0 core 1
L1L2 L1L2
shared L3
core 2 core 3
L1L2 L1L2
QPI: QuickPath Interconnect replaces FSB (QPI is a point-to-point interconnection – with a memory controller now on-die – in order to allow both reduced latency and higher bandwidth up to (theoretically) 25.6 GBytes data transfer, i.e. 2 FSB)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 126
Hardware Excursion Intel E5-2600 Sandy-Bridge Series
2 CPUs connected by 2 QPIs (Intel Quick Path Interconnect) Quick Path Interconnect (1 sending and 1 receiving port)
8 GT/s ∙ 16 Bit/T payload ∙ 2 directions / 8 Bit/Byte = 32 GB/s max bandwidth / QPI
2 QPI links: 2 ∙ 32 GB/s 64 GB/s max bandwidth
source: G. Wellein, RRZE
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 127
Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 128
Supercomputers arrival of clusters
in the late eighties, PCs became a commodity market with rapidlyincreasing performance, mass production, and decreasing prices
growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of
commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MPI library
1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 129
Supercomputers supercomputers
supercomputing or high-performance scientific computing as the most important application of the big number crunchers
national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in 199293 decision: develop, build, and install a series of five
supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National
Laboratory, the world’s first TFlops computer) then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain,
ASCI White,
meanwhile new high-end computing memorandum (2004)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 130
Supercomputers supercomputers (cont’d)
federal “Bundeshöchstleistungsrechner” initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich,
Stuttgart, and Jülich) one new installation every second year (i.e. a six year upgrade
cycle for each centre) the newest one to be among the top 10 of the world
overview and state of the art: Top500 list (updated every six month), see http:www.top500.org
finally (a somewhat different definition)Supercomputer: Turns CPU-bound problems into I/O-bound problems.
—Ken Batcher
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 131
Supercomputers MOORE’s law
observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965)
number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every eighteen months
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 132
Supercomputers some numbers: Top500
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 133
Supercomputers some numbers: Top500 (cont’d)
Citius, altius, fortius!
Citius, altius, fortius!
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 134
Supercomputers some numbers: Top500 (cont’d)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 135
Supercomputers the 10 fastest supercomputers in the world (by November 2014)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 136
Supercomputers The Earth Simulator – world’s #1 from 2002—04
installed in 2002 in Yokohama, Japan ES-building (approx. 50m 65m 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching)
8 vector processors (8 GFlops each) 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; 35.86 TFlops sustained performance (Linpack)
nodes connected by 640640 single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth)
further 700 TB disc space and 1.60 PB mass storage
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 137
Supercomputers BlueGeneL – world’s #1 from 2004—08
installed in 2005 at LLNL, CA, USA(beta-system in 2004 at IBM)
cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 IO nodes)
2 PowerPC 440d processors (2.8 GFlops each) 512MB memory 131,072 processors (367.00 TFlops peak performance) and33.50 TB memory; 280.60 TFlops sustained performance (Linpack)
nodes configured as 3D torus (32 32 64); global reduction tree for fast operations (global max sum) in a few microseconds
1024 Gbps link to global parallel file system further 806 TB disc space; operating system SuSE SLES 9
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 138
Supercomputers Roadrunner – world’s #1 from 2008—09
installed in 2008 at LANL, NM, USA installation costs about $120 million first “hybrid” supercomputer
dual-core Opteron Cell Broadband Engine 129,600 cores (1456.70 TFlops peak performance) and98 TB memory; 1144.00 TFlops sustained performance (Linpack)
standard processing (file system IO, e. g.) handled by Opteron, while mathematically and CPU-intensive tasks are handled by Cell
2.35 MW power consumption ( 437 MFlops per Watt ) primarily usage: ensure safety and reliability of nation’s nuclear
weapons stockpile, real-time applications (cause & effect in capital markets, bone structures and tissues renderings as patients are being examined, e.g.)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 139
Supercomputers HLRB II (world’s #6 for 042006)
installed in 2006 at LRZ, Garching installation costs 38M€ monthly costs approx. 400,000€ upgrade in 2007 (finished) one of Germany’s 3 supercomputers SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus)
256 blades (ccNUMA link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.80 GFlops) 4GB memory per core
9728 cores (62.30 TFlops peak performance) and 39 TB memory; 56.50 TFlops sustained performance (Linpack)
footprint 24m 12m; total weight 103 metric tons
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 140
Supercomputers SuperMUC (world’s #4 for 062012)
installed in 2012 at LRZ, Garching IBM System x iDataPlex (still) one of Germany’s 3 supercomputers consists of 19 islands (Infiniband FDR10 pruned tree with
4:1 intra-island / inter-island ratio) 18 thin islands with 512 nodes each (total 288 TB memory) Sandy Bridge-EP Xeon E5 (2 CPUs (8 cores each) / node)
1 fat island with 205 nodes (total 52 TB memory) Westmere-EX Xeon E7 (4 CPUs (10 cores each) / node)
147,456 cores (3.185 PFlops peak performance – thin islands only); 2.897 PFlops sustained performance (Linpack)
footprint 21m 26m; warm water cooling
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 141
Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 142
Classification of Parallel Computers standard classification according to FLYNN
global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams
two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process
hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data
drawback: very different computers may belong to the same class
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 143
Classification of Parallel Computers standard classification according to FLYNN (cont’d)
SISD one processing unit that has access to one data memory and to
one program memory classical monoprocessor following VON NEUMANN’s principle
processor program memorydata memory
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 144
Classification of Parallel Computers standard classification according to FLYNN (cont’d)
SIMD several processing units, each with separate access to a (shared or
distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict
synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes out-
dated due to recent developments at commodity market
processor
program memory
data memory
data memory processor
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 145
Classification of Parallel Computers standard classification according to FLYNN (cont’d)
MISD several processing units that have access to one data memory;
several program memories not very popular class (mainly for special applications such as Digital
Signal Processing) operating on a single stream of data, forwarding results from one
processing unit to the next example: systolic array (network of primitive processing elements that
“pump” data)
processor program memory
data memory
processor program memory
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 146
Classification of Parallel Computers standard classification according to FLYNN (cont’d)
MIMD several processing units, each with separate access to a (shared or
distributed) data memory; several program memories classification according to (physical) memory organisation
shared memory shared (global) address space distributed memory distributed (local) address space
example: multiprocessor systems, networks of computers
processor program memorydata memory
data memory processor program memory
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 147
Classification of Parallel Computers
MesMSdistributed address space
Mem-MesMS (hybrid)MemMS, SMPsharedaddress space
distributed memoryglobal memory
processor coupling cooperation of processors computers as well as their shared use
of various resources require communication and synchronisation the following types of processor coupling can be distinguished
memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 148
Classification of Parallel Computers processor coupling (cont’d)
uniform memory access (UMA) each processor P has direct access via the network to each
memory module M with same access times to all data standard programming model can be used (i.e. no explicit send receive of messages necessary)
communication and synchronisation via shared variables (inconsistencies (write conflicts, e.g.) have to prevented in general by the programmer)
M
network
P PP
M M
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 149
Classification of Parallel Computers processor coupling (cont’d)
symmetric multiprocessor (SMP) only a small amount of processors, in most cases a central bus,
one address space (UMA), but bad scalability cache-coherence implemented in hardware (i.e. a read always
provides a variable’s value from its last write) example: double or quad boards, SGI Challenge
M
C
P
CC
PP
C: cache
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 150
Classification of Parallel Computers processor coupling (cont’d)
non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of
data (i.e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM VSM, Cray T3E
P
M
network
M
P
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 151
Classification of Parallel Computers processor coupling (cont’d)
cache-coherent non-uniform memory access (ccNUMA) caches for local and remote addresses; cache-coherence
implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000
C
M
network
P
M
C
P
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 152
Classification of Parallel Computers processor coupling (cont’d)
cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1
P
C
network
CC
PP
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 153
Classification of Parallel Computers processor coupling (cont’d)
no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange
(due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and IO due to
parallel data transfer (Direct Memory Access, e.g.) possible example: IBM SP2, ASCI Red Blue White
M
P
network
M
P
M
P
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 154
Classification of Parallel Computers difference between processes and threads
program (*.exe, *.out, e.g.)
messages
messagesproc
ess
mod
el(N
OR
MA
)
program (*.exe, *.out, e.g.)
thre
ad m
odel
(UM
A, N
UM
A)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 155
Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 156
Quantitative Performance Evaluation execution time
time T of a parallel program between start of the execution on one processor and end of all computations on the last processor
during execution all processors are in one of the following states compute TCOMP: time spent for computations
communicate TCOMM: time spent for send and receive operations
idle TIDLE: time spent for waiting (sending receiving messages)
hence T TCOMP TCOMM TIDLE
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 157
Quantitative Performance Evaluation comparison multiprocessor monoprocessor
correlation of multi- and monoprocessor systems’ performance important: program that can be executed on both systems definitions
P(1): amount of unit operations of a program on the monoprocessor system
P(p): amount of unit operations of a program on the multiprocessor systems with p processors
T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles)
T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 158
Quantitative Performance Evaluation comparison multiprocessor monoprocessor (cont’d)
simplifying preconditions T(1) P(1) one operation to be executed in one step on the
monoprocessor system
T(p) P(p) more than one operation to be executed in one step
(for p 2) on the multiprocessor system with p processors
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 159
comparison multiprocessor monoprocessor (cont’d) speed-up
S(p) indicates the improvement in processing speed
efficiency E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p
Quantitative Performance Evaluation
)(
(1))(
pT
TpS with 1 S(p) p
p
pSpE
)()( with 1p E(p) 1
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 160
Quantitative Performance Evaluation comparison multiprocessor monoprocessor (cont’d)
speed-up and efficiency can be seen in two different ways algorithm-independent
best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency
algorithm-dependent parallel algorithm is treated as sequential one to measure the
execution time on the monoprocessor system; “unfair” due to communication and synchronisation overhead relative speed-up relative efficiency
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 161
Quantitative Performance Evaluation scalability
objective: adding further processing elements to the system shall reduce the execution time without any program modifications
i. e. a linear performance increase with an efficiency close to 1 important for the scalability is a sufficient problem size
one porter may carry one suitcase in a minute 60 porters won’t do it in a second but 60 porters may carry 60 suitcases in a minute
in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited
when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 162
AMDAHL’s law the probably most important and most famous estimate for the
speed-up (even if quite pessimistic) underlying model
each program has a sequential part s, 0 s 1, that can only be executed in a sequential way: synchronisation, data IO, …
furthermore, each program consists of a parallelisable part 1sthat can be executed in parallel by several processes; finding the maximum value within a set of numbers, e.g.
hence, the execution time for the parallel program executed on pprocessors can be written as
Quantitative Performance Evaluation
(1)1
(1))( Tp
sTspT
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 163
AMDAHL’s law (cont’d) the speed-up can thus be computed as
when increasing p we finally get AMDAHL’s law
speed-up is bounded: S(p) 1s the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s 0.1)
Quantitative Performance Evaluation
p
ssT
p
sTs
T
pT
T
11
(1)
1(1)
(1)
)(
(1) )(pS
sp
ss
pSpp
1
11
lim )(lim
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 164
Quantitative Performance Evaluation
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
# processes
speed
-up
S(p)
AMDAHL’s law (cont’d) example: s 0.1
independent from p the speed-up is bounded by this limit where’s the error?
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 165
Quantitative Performance Evaluation GUSTAFSON’s law
addresses the shortcomings of AMDAHL’s law as it states that any sufficient large problem can be efficiently parallelised
instead of a fixed problem size it supposes a fixed time concept underlying model
execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part , 0 1
hence, the execution time for the sequential program on the monoprocessor can be written as
T(1) p(1)
the speed-up can thus be computed as
S(p) p(1) p (1p)
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 166
GUSTAFSON’s law (cont’d) difference to AMDAHL
sequential part s(p) is not constant, but gets smaller with increasing p
s(p) 0, 1
often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, )
speed-up is not bounded for increasing p
Quantitative Performance Evaluation
, )(1
)(
pps
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 167
Quantitative Performance Evaluation GUSTAFSON’s law (cont’d)
some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a
monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system
but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main
memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 168
Quantitative Performance Evaluation communication—computation-ratio (CCR)
important quantity measuring the success of a parallelisation relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size
example NN matrix distributed among p processors (Np rows each) iterative method: in each step, each matrix element is replaced
by the average of its eight neighbour values hence, the two neighbouring rows are always necessary computation time: 8NNp communication time: 2N CCR: p4N – what does this mean?
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 169
Twelve ways……to fool the masses when giving performance results on parallel computers.
—David H. Bailey,NASA Ames Research Centre, 1991
1. Quote only 32-bit performance results, not 64-bit results.
2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application.
3. Quietly employ assembly code and other low-level language constructs.
4. Scale up the problem size with the number of processors, but omit any mention of this fact.
5. Quote performance results projected to a full system.
6. Compare your results against scalar, unoptimised codes on Crays.
Technische Universität München
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 170
Twelve ways…7. When direct run time comparisons are required, compare with an old code on
an obsolete system.
8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation.
9. Quote performance in terms of processor utilisation, parallel speed-ups or MFLOPS per dollar.
10. Mutilate the algorithm used in the parallel implementation to match the architecture.
11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.
12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance.