Introduction to Parallel Programming · Introduction to Parallel Programming ... principle of the classification: ... optimizing compilers can increase this potential by modifying

Introduction toIntroduction toParallel ProgrammingParallel Programming

ATHENS Course on

Parallel Numerical SimulationMunich, March 19−23, 2007

Dr. Ralf-Peter Mundani

Scientific Computing in Computer Science

Technische Universität München

Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007

Classification of Parallel Computers


Parallel ComputersParallel Computers

parallel computers consist of a set of processing elements that can collaborate in a coordinated and (partially) simultaneous way inorder to solve a joint taskpossible appearances of such processing elements

specialised units (steps of a vector pipeline or the vector pipeline of a vector computer’s vector unit, e. g.)parallel features in modern monoprocessors (superscalar processor architecture, VLIW processors, multi-threading processor units, e. g.)several uniform arithmetical units (processing elements of an array computer, e. g.)processors or processing nodes of a multiprocessor computercomplete stand-alone computers, connected via a LAN (work station or PC clusters as virtual parallel computers)


Parallel ComputersParallel Computers

possible appearances of such processing elements (cont’d)

parallel computers or clusters connected via a remote network (so-called metacomputers)

target machines in the following: multi- and specialised processors as well as clusters (i. e. the so-called high-performance architecturesor supercomputers)


SupercomputersSupercomputers

supercomputing or high-performance scientific computing as the most important application of the big number crunchers

national initiatives due to huge budget requirements

Advanced Strategic Computing Initiative (ASCI) in the US

in the sequel of the nuclear testing moratorium in 1992/93

decision: develop, build, and install a series of 5 supercomputers of up to 100M dollar each in the US

start: ASCI Red (1997, Intel-based, SNL, the world’s first teraflop computer)

then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, …

meanwhile new High-End Computing memorandum in the US


SupercomputersSupercomputers

national initiatives due to huge budget requirements (cont’d)

federal Bundeshöchstleistungsrechner initiative in Germany

decision in the midth nineties

3 federal supercomputing centres in Germany (München, Stuttgart, and Jülich), one new installation each year, the newest one to be among the top 10 of the world

overview and state of the art: Top500 list (every six months)


Top500 Top500 −− Some NumbersSome Numbers




















The Earth Simulator The Earth Simulator −−WorldWorld’’s #1 from 2002s #1 from 2002−−20042004

installed in 2002 in Yokohama, Japan

ES-building (approx. 50m × 65m × 17m)

based on NEC SX-6 architecture

developed by three governmental agencies

highly parallel vector supercomputer

consists of 640 nodes (plus 2 control & 128 data switching cabinets)

8 vector processors (8 GFlops each)

16 GByte shared memory

in total 5120 processors (40.96 TFlops peak performance) and10 TByte memory; 35.86 TFlops sustained performance (Linpack)

nodes connected by 640×640 single-stage crossbar (83,200 cables with a total extension of 2,400km; 8 TByte/s total bandwith)

further 700 TByte disc space and 1.6 PByte mass storage


BlueGene/L BlueGene/L −−WorldWorld’’s #1 since 2004s #1 since 2004

installed in 2005 at LLNL, CA, USA(beta-system in 2004 at IBM)

cooperation of DoE, LLNL, and IBM

massiv parallel supercomputer

consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes)

2 PowerPC 440d processors (2.8 GFlops each)

512 MByte memory

in total 131,072 processors (367 TFlops peak performance) and 33.5 TByte memory; 280.6 TFlops sustained performane (Linpack)

nodes configured as 3D torus (32 × 32 × 64); global reduction tree for fast operations (global max/sum) in a few mircoseconds

1024 Gigabit/s link to global parallel file system

further 806 TByte disc space; operating system SuSE SLES 9


HLRB II (SGI Altix 4700) HLRB II (SGI Altix 4700) −−World’s #18 since 2006World’s #18 since 2006

installed in 2006 at LRZ, Garching

installation costs 38M Euro

monthly costs approx. 400,000 Euro

one of Germany’s 3 supercomputers

consists of 16 nodes (SGI NUMA link 2D torus)

256 blades (ccNUMA link with partition fat tree)

Intel Itanium2 1.6 GHz (4.6 GFlops; 4 FP-operations/clock)

4 GByte memory

in total 4096 processors (26.21 TFlops peak performance) and17.5 TByte memory; 24.36 TFlops sustained performance (Linpack)

upgrade in 2007 from 4096 to 9728 processor cores (currently done)

Dual-Core Intel Itanium2 Montecito

62.3 TFlops peak performance


Standard Classification According to FStandard Classification According to FLYNNLYNN

principle of the classification: computers as operators on two kinds of information streams

instruction stream: sequence of commands to be executeddata streams: sequence of data subject to instruction streams

this results in a two-dimensional subdivision of the variety of computer architectures

number of instructions executed at a certain point of timenumber of data elements processed at a certain point of time

hence, FLYNN distinguishes four classes of architecturessingle instruction − single data (SISD)single instruction − multiple data (SIMD)multiple instruction − single data (MISD)multiple instruction − multiple data (MIMD)

drawback: very different computers may belong to the same class



SISD

the classical monoprocessor following VON NEUMANN’s principle

SIMD

array computers: consist of a large number (65,536 and more) of uniform processing elements arranged in a regular way, which −under central control − all apply the same instructions to some part of the data each, simultaneously

vector computers: consist of at least one vector pipeline (functional unit designed as a pipeline for processing vectors of floating point numbers)

MISD

a pipeline of multiple independently executing functional units operating on a single stream of data, forwarding results from one functional unit to the next



MISD (cont’d)

not very popular class (mainly for special applications such as digital signal processing)

systolic array: a network of primitive processing elements that “pump” data (a hardware priority queue with constant-complexity operations can be built out of primitive three-number sorting elements, e. g.)

MIMD

multiprocessor systems, i. e. the classical parallel computer

networks of computer


Processor CouplingProcessor Coupling

cooperation of processors or computers as well as their shared use of various resources require communication and synchronisation

depending on the type of processor coupling, we distinguish

memory-coupled multiprocessor systems

message-coupled multiprocessor systems

memory coupling (strong coupling)

shared address space (physically and logically) for all processors, so-called shared memory

communication and synchronisation via shared variables

example: SMP (symmetric multiprocessors), where the access to global memory is identical for all processors

connection to memory realised via a central bus or via more complex structures (crossbar switch …)


Processor CouplingProcessor Coupling

message-coupling (weak or loose coupling)

physically distributed (local) memories and local address spaces, so-called distributed memory

communication via the exchange of messages through the network

synchronisation implicitly via communication instructions


A Hybrid Type: DSM/VSMA Hybrid Type: DSM/VSM

central issuesscalability: How simple is it to add new nodes (processors) to the system?programming model: How complicated is programming?portability: How simple is portation/migration, i. e. the transfer from one processor to another one, if executability and functionality shall be preserved?load distribution: How difficult is it to obtain a uniform distribution of the work load among the processors?

message-coupled systems are advantageous concerning scalability, memory-coupled systems are better w. r. t. the other aspectsidea: combine advantages of bothDSM (distributed shared memory) or VSM (virtual shared memory)

physically distributed (local) memorynevertheless one global shared address space


An Alternative Classification due toAn Alternative Classification due toProcessor CouplingProcessor Coupling

type of processor coupling allows for an alternative to FLYNN’s classificationuniform memory access (UMA)

access to shared memory is identical for all processorssame access times for all processors to all dataof course, a local cache is possible for each processorclassical representative: SMP

non-uniform memory access (NUMA)memory modules are physically distributed among processorsnevertheless a shared global address spaceaccess times depend on the location of the data (local or remote)typical representative: DSM/VSM

no remote memory access (NORMA)systems with distributed memory (physically and logically)no direct access to another processor’s local world


Levels of Parallelism


GranularityGranularity

the decision which type of parallel architecture is best-suited for a given parallel program strongly depends on the character and, esp., on the granularity of parallelism

some remarks on granularity

qualitative meaning: the level on which work is done in parallel

we distinguish coarse-grain and fine-grain parallelism

quantitative meaning: ratio of computational effort and communication or synchronisation effort (roughly speaking the number of instructions between two necessary steps of communication)

starting point of the following considerations: a parallel programm



typically, five different levels are identifiedprogram level

parallel processing of different programsindependent units without any shared datano or only a small amount of communicationorganised by the operating system

process levela program is subdivided into different processes to be executed in paralleleach process: large number of sequential instructions, private address spacesynchronisation is necessarycommunication in most cases necessary (data exchange …)support by operating system via routines



typically, five different levels are identified (cont’d)block level

here, the units running in parallel are blocks of instructions or light-weight processes (threads)smaller number of instructions, which share the address space with other blockscommunication via shared variables and synchronisation mechanisms

instruction levelparallel execution of machine instructionsoptimizing compilers can increase this potential by modifying the order of the commands

sub-instruction levelinstructions are subdivided still further in units that can be executed in parallel or via overlapping


Techniques of Parallel WorkTechniques of Parallel Work

the different levels of parallelism have methods of parallel work in the hardware as their counterpartsobjective: best exploitation of the inherent potentiallevels of parallel work

computer coupling: useful for program level only, sometimes alsofor process levelprocessor coupling

message-coupling for program and process levelmemory-coupling for program, process, and block level

parallel work within the processor architecture: instruction pipelining, superscalar organisation, VLIW etc. for instruction level only, eventually for sub-instruction levelSIMD techniques: concerning the sub-instruction level in vector and array computers


Quantitative Performance Evaluation


Performance EvaluationPerformance Evaluation

standard quantities for monoprocessors

millions of instructions per second (MIPS)

millions of floating point operations per second (MFLOPS)

not sufficient for parallel computers

in which context was the measured performance achieved (interconnection structure, granularity of parallelism)?

how efficient is the parallelisation itself (obtaining a runtimereduction of a factor 5 with 10 processors is definitely no cunning trick)?

another issue

what is due to the parallel computer?

what is due to the parallel algorithm or program?


Notions of Time in the Execution of Notions of Time in the Execution of InstructionsInstructions

not only simple instruction time, but more detailed considerations instead

execution time T of a parallel program: time between start of the execution on the first participating processor and end of all computations on the last participating processor

computation time Tcomp of a parallel program: part of the execution time used for computations

communication time Tcomm of a parallel program: part of the execution time used for send and receive operations

idle time Tidle of a parallel program: part of the execution time used for waiting (or sending or receiving)

T = Tcomp + Tcomm + Tidle


Notions of Time in the Transmission of DataNotions of Time in the Transmission of Data

further subdivision of communication

communication time Tmsg of a message: time needed to send a message from one processor to another one

setup time Ts: time for preparing and initialising the communication step

transfer time Tw per data word transmitted: depends on the bandwidth of the transmission channel

Tmsg = Ts + Tw ⋅ n (n data words)

of course, this relation holds only in case of a dedicated(conflict-free) connection


Average ParallelismAverage Parallelism

total work during a parallel computation

wherel: performance of one single processorp: number of processorsti: time when exactly i processors are busy

average parallelism

for A(p), there exist several theoretical estimates (typically quite pessimistic), which were often used as arguments against massively parallel systems

∑=

⋅⋅=p

iitilpW

1:)(

∑∑∑

==

= ⋅=⋅

= p

i ip

i i

p

i i

tpW

lt

tipA

11

1 )(1:)(


Comparison Multiprocessor Comparison Multiprocessor −−MonoprocessorMonoprocessor

(program-dependent) times

T(1): execution time on a monoprocessor

T(p): execution time on a p-processor

speed-up S(p)

S(p) = T(1) / T (p) bounds: 1 ≤ S(p) ≤ p

efficiency E(p)

E(p) = S(p) / p = T(1) / p ⋅ T(p) bounds: 1/p ≤ E(p) ≤ 1

speed-up and efficiency come in two variants

algorithm-independent (absolute): compare the best known sequential algorithm with the given parallel one

algorithm-dependent (relative): compare the parallel algorithm with its sequential counterpart (or itself used sequentially)

which point of view is the more objective one?


Scalability and OverheadScalability and Overhead

scalability

objective: adding more processors to the system shall reduce theexecution time significantly without necessary code modifications

scalability requires a sufficient problem size: 1 porter / 60 porters

therefore often scaled problem analysis: with increasing number of processors, increase problem size, too

overhead

P(1): number of unit operations on a monoprocessor system

P(p): number of unit operations on a p-processor system

definition: R(p) = P(p) / P(1), bound: 1 ≤ R(p)

describes the additional number of operations for organisation, synchronisation, and communication


AAMDAHLMDAHL’s Law’s Law

the probably most important and most famous estimate for speed-up

underlying model

each program consists of parallelisable parts that can be executed only in a sequential way; sequential part s, 0 ≤ s ≤ 1

then, the following holds for execution time and speed-up

thus, we get AMDAHL’s Law: S(p) ≤ 1 / s

meaning

sequential part can have dramatic impact on speed-up

therefore central effort of all (parallel) algorithmics: keep s small

this is possible: about 75% of all Linpack routines fulfill s < 0.1

;)1(1)1()( sTp

sTpT ⋅+−

⋅=spT

TpSps +

==−1

1)()1()(


Model of GModel of GUSTAFSONUSTAFSON

alternative model for speed-up prediction or estimationunderlying data

normalise the execution time on the parallel machine to 1there: non-parallelisable part σhence execution time on the monoprocessor

this results in a speed-up of

difference to AMDAHL

sequential part – w. r. t. execution time on one processor – is not constant, but gets smaller with increasing pin GUSTASON’s model, speed-up not bounded for increasing p

)1()1( σσ −⋅+= pT

)1()1()( ppppS −+=−⋅+= σσσ


CommunicationCommunication−−ComputationComputation--Ratio (CCR)Ratio (CCR)

important quantity measuring the sucess of a parallelisationgives the relation of pure communication time and pure computation timea small CCR is favourabletypically: CCR decreases with increasing problem sizeexample

consider a full N × N-matrixconsider the following iterative method: in each step, each element is replaced by the average of its eight neighboursfor each rows’s update, we need the two neighbouring rowsp processors, decompose the matrix in p blocks of N/p rowscomputing time: 8N⋅N / pcommunication time: 2(p-1)⋅Nhence, CCR is (p2-p) / 4N – what does this mean?


[email protected]

http://www5.in.tum.de/~mundani/

Documents

Introduction to Parallel Programming · Introduction to Parallel Programming ... principle of the classification: ... optimizing compilers can increase this potential by modifying