Upload
vodat
View
237
Download
0
Embed Size (px)
Citation preview
Introduction toIntroduction toParallel ProgrammingParallel Programming
ATHENS Course on
Parallel Numerical SimulationMunich, March 19−23, 2007
Dr. Ralf-Peter Mundani
Scientific Computing in Computer Science
Technische Universität München
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Classification of Parallel Computers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Parallel ComputersParallel Computers
parallel computers consist of a set of processing elements that can collaborate in a coordinated and (partially) simultaneous way inorder to solve a joint taskpossible appearances of such processing elements
specialised units (steps of a vector pipeline or the vector pipeline of a vector computer’s vector unit, e. g.)parallel features in modern monoprocessors (superscalar processor architecture, VLIW processors, multi-threading processor units, e. g.)several uniform arithmetical units (processing elements of an array computer, e. g.)processors or processing nodes of a multiprocessor computercomplete stand-alone computers, connected via a LAN (work station or PC clusters as virtual parallel computers)
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Parallel ComputersParallel Computers
possible appearances of such processing elements (cont’d)
parallel computers or clusters connected via a remote network (so-called metacomputers)
target machines in the following: multi- and specialised processors as well as clusters (i. e. the so-called high-performance architecturesor supercomputers)
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
SupercomputersSupercomputers
supercomputing or high-performance scientific computing as the most important application of the big number crunchers
national initiatives due to huge budget requirements
Advanced Strategic Computing Initiative (ASCI) in the US
in the sequel of the nuclear testing moratorium in 1992/93
decision: develop, build, and install a series of 5 supercomputers of up to 100M dollar each in the US
start: ASCI Red (1997, Intel-based, SNL, the world’s first teraflop computer)
then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, …
meanwhile new High-End Computing memorandum in the US
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
SupercomputersSupercomputers
national initiatives due to huge budget requirements (cont’d)
federal Bundeshöchstleistungsrechner initiative in Germany
decision in the midth nineties
3 federal supercomputing centres in Germany (München, Stuttgart, and Jülich), one new installation each year, the newest one to be among the top 10 of the world
overview and state of the art: Top500 list (every six months)
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Top500 Top500 −− Some NumbersSome Numbers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
The Earth Simulator The Earth Simulator −−WorldWorld’’s #1 from 2002s #1 from 2002−−20042004
installed in 2002 in Yokohama, Japan
ES-building (approx. 50m × 65m × 17m)
based on NEC SX-6 architecture
developed by three governmental agencies
highly parallel vector supercomputer
consists of 640 nodes (plus 2 control & 128 data switching cabinets)
8 vector processors (8 GFlops each)
16 GByte shared memory
in total 5120 processors (40.96 TFlops peak performance) and10 TByte memory; 35.86 TFlops sustained performance (Linpack)
nodes connected by 640×640 single-stage crossbar (83,200 cables with a total extension of 2,400km; 8 TByte/s total bandwith)
further 700 TByte disc space and 1.6 PByte mass storage
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
BlueGene/L BlueGene/L −−WorldWorld’’s #1 since 2004s #1 since 2004
installed in 2005 at LLNL, CA, USA(beta-system in 2004 at IBM)
cooperation of DoE, LLNL, and IBM
massiv parallel supercomputer
consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes)
2 PowerPC 440d processors (2.8 GFlops each)
512 MByte memory
in total 131,072 processors (367 TFlops peak performance) and 33.5 TByte memory; 280.6 TFlops sustained performane (Linpack)
nodes configured as 3D torus (32 × 32 × 64); global reduction tree for fast operations (global max/sum) in a few mircoseconds
1024 Gigabit/s link to global parallel file system
further 806 TByte disc space; operating system SuSE SLES 9
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
HLRB II (SGI Altix 4700) HLRB II (SGI Altix 4700) −−World’s #18 since 2006World’s #18 since 2006
installed in 2006 at LRZ, Garching
installation costs 38M Euro
monthly costs approx. 400,000 Euro
one of Germany’s 3 supercomputers
consists of 16 nodes (SGI NUMA link 2D torus)
256 blades (ccNUMA link with partition fat tree)
Intel Itanium2 1.6 GHz (4.6 GFlops; 4 FP-operations/clock)
4 GByte memory
in total 4096 processors (26.21 TFlops peak performance) and17.5 TByte memory; 24.36 TFlops sustained performance (Linpack)
upgrade in 2007 from 4096 to 9728 processor cores (currently done)
Dual-Core Intel Itanium2 Montecito
62.3 TFlops peak performance
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Standard Classification According to FStandard Classification According to FLYNNLYNN
principle of the classification: computers as operators on two kinds of information streams
instruction stream: sequence of commands to be executeddata streams: sequence of data subject to instruction streams
this results in a two-dimensional subdivision of the variety of computer architectures
number of instructions executed at a certain point of timenumber of data elements processed at a certain point of time
hence, FLYNN distinguishes four classes of architecturessingle instruction − single data (SISD)single instruction − multiple data (SIMD)multiple instruction − single data (MISD)multiple instruction − multiple data (MIMD)
drawback: very different computers may belong to the same class
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Standard Classification According to FStandard Classification According to FLYNNLYNN
SISD
the classical monoprocessor following VON NEUMANN’s principle
SIMD
array computers: consist of a large number (65,536 and more) of uniform processing elements arranged in a regular way, which −under central control − all apply the same instructions to some part of the data each, simultaneously
vector computers: consist of at least one vector pipeline (functional unit designed as a pipeline for processing vectors of floating point numbers)
MISD
a pipeline of multiple independently executing functional units operating on a single stream of data, forwarding results from one functional unit to the next
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Standard Classification According to FStandard Classification According to FLYNNLYNN
MISD (cont’d)
not very popular class (mainly for special applications such as digital signal processing)
systolic array: a network of primitive processing elements that “pump” data (a hardware priority queue with constant-complexity operations can be built out of primitive three-number sorting elements, e. g.)
MIMD
multiprocessor systems, i. e. the classical parallel computer
networks of computer
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Processor CouplingProcessor Coupling
cooperation of processors or computers as well as their shared use of various resources require communication and synchronisation
depending on the type of processor coupling, we distinguish
memory-coupled multiprocessor systems
message-coupled multiprocessor systems
memory coupling (strong coupling)
shared address space (physically and logically) for all processors, so-called shared memory
communication and synchronisation via shared variables
example: SMP (symmetric multiprocessors), where the access to global memory is identical for all processors
connection to memory realised via a central bus or via more complex structures (crossbar switch …)
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Processor CouplingProcessor Coupling
message-coupling (weak or loose coupling)
physically distributed (local) memories and local address spaces, so-called distributed memory
communication via the exchange of messages through the network
synchronisation implicitly via communication instructions
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
A Hybrid Type: DSM/VSMA Hybrid Type: DSM/VSM
central issuesscalability: How simple is it to add new nodes (processors) to the system?programming model: How complicated is programming?portability: How simple is portation/migration, i. e. the transfer from one processor to another one, if executability and functionality shall be preserved?load distribution: How difficult is it to obtain a uniform distribution of the work load among the processors?
message-coupled systems are advantageous concerning scalability, memory-coupled systems are better w. r. t. the other aspectsidea: combine advantages of bothDSM (distributed shared memory) or VSM (virtual shared memory)
physically distributed (local) memorynevertheless one global shared address space
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
An Alternative Classification due toAn Alternative Classification due toProcessor CouplingProcessor Coupling
type of processor coupling allows for an alternative to FLYNN’s classificationuniform memory access (UMA)
access to shared memory is identical for all processorssame access times for all processors to all dataof course, a local cache is possible for each processorclassical representative: SMP
non-uniform memory access (NUMA)memory modules are physically distributed among processorsnevertheless a shared global address spaceaccess times depend on the location of the data (local or remote)typical representative: DSM/VSM
no remote memory access (NORMA)systems with distributed memory (physically and logically)no direct access to another processor’s local world
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Levels of Parallelism
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
GranularityGranularity
the decision which type of parallel architecture is best-suited for a given parallel program strongly depends on the character and, esp., on the granularity of parallelism
some remarks on granularity
qualitative meaning: the level on which work is done in parallel
we distinguish coarse-grain and fine-grain parallelism
quantitative meaning: ratio of computational effort and communication or synchronisation effort (roughly speaking the number of instructions between two necessary steps of communication)
starting point of the following considerations: a parallel programm
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
GranularityGranularity
typically, five different levels are identifiedprogram level
parallel processing of different programsindependent units without any shared datano or only a small amount of communicationorganised by the operating system
process levela program is subdivided into different processes to be executed in paralleleach process: large number of sequential instructions, private address spacesynchronisation is necessarycommunication in most cases necessary (data exchange …)support by operating system via routines
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
GranularityGranularity
typically, five different levels are identified (cont’d)block level
here, the units running in parallel are blocks of instructions or light-weight processes (threads)smaller number of instructions, which share the address space with other blockscommunication via shared variables and synchronisation mechanisms
instruction levelparallel execution of machine instructionsoptimizing compilers can increase this potential by modifying the order of the commands
sub-instruction levelinstructions are subdivided still further in units that can be executed in parallel or via overlapping
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Techniques of Parallel WorkTechniques of Parallel Work
the different levels of parallelism have methods of parallel work in the hardware as their counterpartsobjective: best exploitation of the inherent potentiallevels of parallel work
computer coupling: useful for program level only, sometimes alsofor process levelprocessor coupling
message-coupling for program and process levelmemory-coupling for program, process, and block level
parallel work within the processor architecture: instruction pipelining, superscalar organisation, VLIW etc. for instruction level only, eventually for sub-instruction levelSIMD techniques: concerning the sub-instruction level in vector and array computers
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Quantitative Performance Evaluation
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Performance EvaluationPerformance Evaluation
standard quantities for monoprocessors
millions of instructions per second (MIPS)
millions of floating point operations per second (MFLOPS)
not sufficient for parallel computers
in which context was the measured performance achieved (interconnection structure, granularity of parallelism)?
how efficient is the parallelisation itself (obtaining a runtimereduction of a factor 5 with 10 processors is definitely no cunning trick)?
another issue
what is due to the parallel computer?
what is due to the parallel algorithm or program?
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Notions of Time in the Execution of Notions of Time in the Execution of InstructionsInstructions
not only simple instruction time, but more detailed considerations instead
execution time T of a parallel program: time between start of the execution on the first participating processor and end of all computations on the last participating processor
computation time Tcomp of a parallel program: part of the execution time used for computations
communication time Tcomm of a parallel program: part of the execution time used for send and receive operations
idle time Tidle of a parallel program: part of the execution time used for waiting (or sending or receiving)
T = Tcomp + Tcomm + Tidle
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Notions of Time in the Transmission of DataNotions of Time in the Transmission of Data
further subdivision of communication
communication time Tmsg of a message: time needed to send a message from one processor to another one
setup time Ts: time for preparing and initialising the communication step
transfer time Tw per data word transmitted: depends on the bandwidth of the transmission channel
Tmsg = Ts + Tw ⋅ n (n data words)
of course, this relation holds only in case of a dedicated(conflict-free) connection
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Average ParallelismAverage Parallelism
total work during a parallel computation
wherel: performance of one single processorp: number of processorsti: time when exactly i processors are busy
average parallelism
for A(p), there exist several theoretical estimates (typically quite pessimistic), which were often used as arguments against massively parallel systems
∑=
⋅⋅=p
iitilpW
1:)(
∑∑∑
==
= ⋅=⋅
= p
i ip
i i
p
i i
tpW
lt
tipA
11
1 )(1:)(
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Comparison Multiprocessor Comparison Multiprocessor −−MonoprocessorMonoprocessor
(program-dependent) times
T(1): execution time on a monoprocessor
T(p): execution time on a p-processor
speed-up S(p)
S(p) = T(1) / T (p) bounds: 1 ≤ S(p) ≤ p
efficiency E(p)
E(p) = S(p) / p = T(1) / p ⋅ T(p) bounds: 1/p ≤ E(p) ≤ 1
speed-up and efficiency come in two variants
algorithm-independent (absolute): compare the best known sequential algorithm with the given parallel one
algorithm-dependent (relative): compare the parallel algorithm with its sequential counterpart (or itself used sequentially)
which point of view is the more objective one?
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Scalability and OverheadScalability and Overhead
scalability
objective: adding more processors to the system shall reduce theexecution time significantly without necessary code modifications
scalability requires a sufficient problem size: 1 porter / 60 porters
therefore often scaled problem analysis: with increasing number of processors, increase problem size, too
overhead
P(1): number of unit operations on a monoprocessor system
P(p): number of unit operations on a p-processor system
definition: R(p) = P(p) / P(1), bound: 1 ≤ R(p)
describes the additional number of operations for organisation, synchronisation, and communication
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
AAMDAHLMDAHL’s Law’s Law
the probably most important and most famous estimate for speed-up
underlying model
each program consists of parallelisable parts that can be executed only in a sequential way; sequential part s, 0 ≤ s ≤ 1
then, the following holds for execution time and speed-up
thus, we get AMDAHL’s Law: S(p) ≤ 1 / s
meaning
sequential part can have dramatic impact on speed-up
therefore central effort of all (parallel) algorithmics: keep s small
this is possible: about 75% of all Linpack routines fulfill s < 0.1
;)1(1)1()( sTp
sTpT ⋅+−
⋅=spT
TpSps +
==−1
1)()1()(
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
Model of GModel of GUSTAFSONUSTAFSON
alternative model for speed-up prediction or estimationunderlying data
normalise the execution time on the parallel machine to 1there: non-parallelisable part σhence execution time on the monoprocessor
this results in a speed-up of
difference to AMDAHL
sequential part – w. r. t. execution time on one processor – is not constant, but gets smaller with increasing pin GUSTASON’s model, speed-up not bounded for increasing p
)1()1( σσ −⋅+= pT
)1()1()( ppppS −+=−⋅+= σσσ
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
CommunicationCommunication−−ComputationComputation--Ratio (CCR)Ratio (CCR)
important quantity measuring the sucess of a parallelisationgives the relation of pure communication time and pure computation timea small CCR is favourabletypically: CCR decreases with increasing problem sizeexample
consider a full N × N-matrixconsider the following iterative method: in each step, each element is replaced by the average of its eight neighboursfor each rows’s update, we need the two neighbouring rowsp processors, decompose the matrix in p blocks of N/p rowscomputing time: 8N⋅N / pcommunication time: 2(p-1)⋅Nhence, CCR is (p2-p) / 4N – what does this mean?
Dr. Ralf-Peter Mundani • ATHENS Course on Parallel Numerical Simulation • Munich, March 19−23, 2007
http://www5.in.tum.de/~mundani/