Upload
phungtram
View
224
Download
2
Embed Size (px)
Citation preview
Parallel Computing Parallel Computing ArchitecturesArchitectures
Moreno MarzollaDip. di Informatica—Scienza e Ingegneria (DISI)Università di Bologna
http://www.moreno.marzolla.name/
Parallel Architectures 3
An Abstract Parallel Architecture
● How is parallelism handled?● What is the exact physical location of the memories?● What is the topology of the interconnect network?
Processor
Interconnect Network
Memory Memory Memory
Processor Processor Processor
Parallel Architectures 4
Why are parallel architectures important?
● There is no "typical" parallel computer: different vendors use different architectures
● There is currently no “universal” programming paradigm that fits all architectures– Parallel programs must be tailored to the underlying parallel
architecture– The architecture of a parallel computer limits the choice of
the programming paradigm that can be used
Parallel Architectures 7
Von Neumann architecture
ALU
R0
R1
Rn-1
PC
PSW
IR
MemoryMemory
Bus Control
DataAddress
Control
Parallel Architectures 8
The Fetch-Decode-Execute cycle
● The CPU performs an infinite loop● Fetch
– Fetch the opcode of the next instruction from the memory address stored in the PC register, and put the opcode in the IR register
● Decode– The content of the IR register is analyzed to identify the
instruction to execute ● Execute
– The control unit activates the appropriate functional units of the CPU to perform the actions required by the instruction (e.g., read values from memory, execute arithmetic computations, and so on)
Parallel Architectures 9
● Reduce the memory access latency– Rely on CPU registers whenever possible– Use caches
● Hide the memory access latency– Multithreading and context-switch during memory accesses
● Execute multiple instructions at the same time– Pipelining – Multiple issue– Branch prediction– Speculative execution– SIMD extensions
How to limit the bottlenecks of theVon Neumann architecture
CPU times compared to the real world
1 CPU cycle 0.3 ns 1 sLevel 1 cache access 0.9 ns 3 sMain memory access 120 ns 6 minSolid-state disk I/O 50-150 μs 2-6 daysRotational disk I/O 1-10 ms 1-12 monthsInternet: SF to NYC 40 ms 4 yearsInternet: SF to UK 81 ms 8 yearsInternet: SF to Australia 183 ms 19 yearsPhysical system reboot 1 m 6 millennia
Source: https://blog.codinghorror.com/the-infinite-space-between-words/
Parallel Architectures 12
Cache hierarchy
● Large memories are slow; fast memories are small
CPU
L1 Cache
L2 Cache
L3 Cache
DRAM
(possible) interconnect bus
Parallel Architectures 13
Cache hierarchy of the AMD Bulldozer architecture
Source: http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29
Parallel Architectures 14
CUDA memory hierarchyBlock
Global Memory
Constant Memory
Texture Memory
LocalMemory
Shared Memory
Registers Registers
Thread Thread
LocalMemory
Block
LocalMemory
Shared Memory
Registers Registers
Thread Thread
LocalMemory
Parallel Architectures 15
How the cache works
● Cache memory is very fast– Often located inside the processor chip– Expensive, very small compared to system memory
● The cache contains a copy of the content of some recently accessed (main) memory locations– If the same memory location is accessed again, and its
content is in cache, then the cache is accessed instead of the system RAM
Parallel Architectures 16
How the cache works
● If the CPU accesses a memory location whose content is not already in cache– the content of that memory location and "some" adjacent
locations are copied in cache– in doing so it might be necessary to purge some old data
from the cache to make room to the new data● The smallest unit of data that can be transferred
to/from the cache is the cache line– On Intel processors, usually 1 cache line = 64B
Parallel Architectures 23
Spatial and temporal locality
● Cache memory works well when applications exhibit spatial and/or temporal locality
● Spatial locality– Accessing adjacent memory locations is OK
● Temporal locality– Repeatedly accessing the same memory location(s) is OK
Parallel Architectures 24
Example: matrix-matrix product
● Given two square matrices p, q, compute r = p q
rqp
void matmul( double *p, double* q, double *r, int n){ int i, j, k; for (i=0; i<n; i++) { for (j=0; j<n; j++) { r[i*n + j] = 0.0; for (k=0; k<n; k++) { r[i*n + j] += p[i*n + k] * q[k*n + j]; } } }}
i
j
i
j
Parallel Architectures 25
Matrix representation
● Matrices in C are stored in row-major order– Elements of each row are contiguous in memory– Adjacent rows are contiguous in memory
p[0][0]
p[1][0]
p[2][0]
p[3][0]
p[0][0] p[1][0] p[2][0] p[3][0]
p[4][4]
Parallel Architectures 26
Row-wise vs column-wise access
Row-wise access is OK: the data is contiguous in memory, so the cache helps (spatial locality)
Column-wise access is NOT OK: the accessed elements are not contiguous in memory (strided access) so the cache does NOT help
Parallel Architectures 27
Matrix-matrix product
● Given two squared matrices p, q, compute r = p q
rqp
i
j
i
j
NOT ok; column-wise access
OK; row-wise access
Parallel Architectures 28
Optimizing the memory access patternp00 p01 p02 p03
p10 p11 p12 p13
p20 p21 p22 p23
p30 p31 p32 p33
q00 q10 q02 q03
q10 q11 q12 q13
q20 q21 q22 q23
q30 q31 q32 q33
r00 r01 r02 r03
r10 r11 r12 r13
r20 r21 r22 r23
r30 r31 r32 r33
p q r
Parallel Architectures 29
Optimizing the memory access patternp00 p01 p02 p03
p10 p11 p12 p13
p20 p21 p22 p23
p30 p31 p32 p33
q00 q10 q02 q03
q10 q11 q12 q13
q20 q21 q22 q23
q30 q31 q32 q33
r00 r01 r02 r03
r10 r11 r12 r13
r20 r21 r22 r23
r30 r31 r32 r33
p q r
p00 p01 p02 p03
p10 p11 p12 p13
p20 p21 p22 p23
p30 p31 p32 p33
q00
q01
q02
q03
q10
q11
q12
q13
q20
q21
q22
q23
q30
q31
q32
q33
r00 r01 r02 r03
r10 r11 r12 r13
r20 r21 r22 r23
r30 r31 r32 r33
p qT r
Transpose q
Parallel Architectures 30
But...
● Transposing the matrix q requires time. Do we gain some advantage in doing so?
● See cache.c
Parallel Architectures 32
Instruction-Level Parallelism● Uses multiple functional units to increase the
performance of a processor– Pipelining: the functional units are organized like an
assembly line, and can be used strictly in that order– Multiple issue: the functional units can be used whenever
required● Static multiple issue: the order in which functional units are activated
is decided at compile time (example: Intel IA64)● Dynamic multiple issue (superscalar): the order in which functional
units are activated is decided at run time
Parallel Architectures 33
Instruction-Level ParallelismIF ID EX MEM WB
IF Instruction FetchID Instruction DecodeEX ExecuteMEM Memory AccessWB Write Back
Instruction Fetchand Decode Unit
Integer Integer FloatingPoint
LoadStore
Commit UnitIn-ordercommit
In-orderissue
Out of order execute
FloatingPoint
Pipelining
MultipleIssue
Parallel Architectures 34
PipeliningIF ID EX MEM WB
Instr1
IF ID EX MEM WB
Instr2
Instr1
IF ID EX MEM WB
Instr3
Instr2
Instr1
IF ID EX MEM WB
Instr4 Instr
3Instr
2Instr
1
IF ID EX MEM WB
Instr5 Instr
4Instr
3Instr
2Instr
1
Flusso di istruzioni
Parallel Architectures 35
Control Dependency
z = x + y;
if ( z > 0 ) { w = x ;} else { w = y ;}
● The instructions “w = x” and “w = y” have a control dependency on “z > 0”
● Control dependencies can limit the performance of pipelined architectures
z = x + y;c = z > 0;w = x*c + y*(1-c);
Parallel Architectures 36
In the real world...
● From GCC documentationhttp://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
— Built-in Function: long __builtin_expect(long exp, long c)
You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that exp == c. For example:
if (__builtin_expect (x, 0)) foo ();
Parallel Architectures 37
Branch Hint: Example
#include <stdlib.h>
int main( void ){
int A[1000000];size_t i;const size_t n = sizeof(A) / sizeof(A[0]);for ( i=0; __builtin_expect( i<n, 1 ); i++ ) {
A[i] = i;}return 0;
}
Refrain from this kind of micro-optimization: ideally, this is stuff for compiler writers
Parallel Architectures 38
Hardware multithreading
● Allows the CPU to switch to another task when the current task is stalled
● Fine-grained multithreading– A context switch has essentially zero cost– The CPU can switch to another task even on stalls of short
durations (e.g., waiting for memory operations)– Requires CPU with specific support, e.g., Cray XMT
● Coarse-grained multithreading– A context switch has non-negligible cost, and is appropriate
only for threads blocked on I/O operations or similar– The CPU is less efficient in the presence of stalls of short
duration
Source: https://www.slideshare.net/jasonriedy/sting-a-framework-for-analyzing-spaciotemporal-interaction-networks-and-graphs
Parallel Architectures 40
Hardware multithreading● Simultaneous multithreading (SMT) is an implementation
of fine-grained multithreading where different threads can use multiple functional units at the same time
● HyperThreading is Intel's implementation of SMT– Each physical processor core is seen by the Operating System
as two "logical" processors – Each “logical” processor maintains a complete set of the
architecture state:● general-purpose registers● control registers● advanced programmable interrupt controller (APIC) registers● some machine state registers
– Intel claims that HT provides a 15—30% speedup with respect to a similar, non-HT processor
Parallel Architectures 41
HyperThreading
● The pipeline stages are separated by two buffers (one for each executing thread)
● If one thread is stalled, the other one might go ahead and fill the pipeline slot, resulting in higher processor utilization
IF ID EX MEM WB
Que
ueQ
ueue
Que
ueQ
ueue
Que
ueQ
ueue
Que
ueQ
ueue
Hyper-Threading Technology Architecture and Microarchitecture http://www.cs.virginia.edu/~mc2zk/cs451/vol6iss1_art01.pdf
HyperThreading
● See the output of lscpu ("Thread(s) per core") or lstopo (hwloc Ubuntu/Debian package)
Processor without HT Processor with HT
Parallel Architectures 44
Flynn's Taxonomy
SISDSingle Instruction Stream
Single Data Stream
SIMDSingle Instruction Stream
Multiple Data Streams
MISDMultiple Instruction Streams
Single Data Stream
MIMDMultiple Instruction Streams
Multiple Data Streams
Data Streams
Single Multiple
Inst
ruct
ion
Str
eam
s
Mul
tiple
Sin
gle
Von Neumann architecture
Parallel Architectures 45
SIMD
● SIMD instructions apply the same operation (e.g., sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)– This means that there must be 4/8/... independent ALUs
LOAD A[0]
LOAD B[0]
C[0] = A[0] + B[0]
STORE C[0]
LOAD A[1]
LOAD B[1]
C[1] = A[1] + B[1]
STORE C[1]
LOAD A[2]
LOAD B[2]
C[2] = A[2] + B[2]
STORE C[2]
LOAD A[3]
LOAD B[3]
C[3] = A[3] + B[3]
STORE C[3]
Tim
e
Parallel Architectures 46
XMM7
SSEStreaming SIMD Extensions
● Extension to the x86 instruction set● Provide new SIMD instructions operating on small
arrays of integer or floating-point numbers● Introduces 8 new 128-bit registers (XMM0—XMM7)● SSE2 instructions can handle
– 2 64-bit doubles, or– 2 64-bit integers, or– 4 32-bit integers, or– 8 16-bit integers, or– 16 8-bit chars
XMM0
32 32 32 32128 bit
XMM1
Parallel Architectures 47
SSE (Streaming SIMD Extensions)
X3 X2 X1 X0
X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0
32 bit 32 bit 32 bit 32 bit
Y3 Y2 Y1 Y0
Op Op Op Op
4 lanes
Parallel Architectures 48
Example__m128 a = _mm_set_ps( 1.0, 2.0, 3.0, 4.0 );__m128 b = _mm_set_ps( 2.0, 4.0, 6.0, 8.0 );__m128 ab = _mm_mul_ps(a, b);
1.0 2.0 3.0 4.0
2.0 4.0 6.0 8.0
2.0 8.0 18.0 32.0
a
b
ab
32 bit 32 bit 32 bit 32 bit
Parallel Architectures 49
GPU
Chip GPU Fermi (fonte: http://www.legitreviews.com/article/1100/1/)
● Modern GPUs (Graphics Processing Units) have a large number of cores, and can be regarded as a form of SIMD architecture
Parallel Architectures 50
CPU vs GPU
ALU ALU
ALU
Control
Cache
DRAM controller
ALU
DRAM controller
CPU GPU
● The difference between CPUs and GPUs can be appreciated by looking at how the chip surface is used
Parallel Architectures 51
GPU core
● A single CPU core contains a fetch/decode unit shared among multiple ALUs– If there are 8 ALU, each
instruction can operate on 8 values simultaneously
● Each GPU core maintains multiple execution contexts, and can switch between them at virtually zero cost– Fine-grained parallelism
ALU ALU ALU ALU
ALU ALU ALU ALU
Ctx Ctx
Ctx Ctx
Ctx Ctx
Ctx Ctx
Fetch / Decode
Ctx
Ctx
Ctx
Ctx
Parallel Architectures 53
MIMD
● In MIMD systems there are multiple execution units that can execute multiple sequences of instructions– Multiple Instruction Streams
● Each execution unit generally operates on its own input data– Multiple Data Streams
CALL F()
z = 8
y = 1.7
z = x + y
a = 18
b = 9
if ( a>b ) c = 7
a = a - 1
w = 7
t = 13
k = G(w,t)
k = k + 1
Tim
e
LOAD A[0]
LOAD B[0]
C[0] = A[0] + B[0]
STORE C[0]
Parallel Architectures 54
MIMD architectures● Shared Memory
– A set of processors sharing a common memory space
– Each processor can access any memory location
● Distributed Memory– A set of compute nodes connected
through an interconnection network
● The most simple example: cluster of PCs connected via Ethernet
– Nodes can share data through explicit communications
Memory
CPU CPU CPU
Interconnect
CPU
CPU
Mem
CPU
Mem
CPU
Mem
CPU
Mem
Interconnect
Parallel Architectures 55
Hybrid architectures
● Many HPC systems are based on hybrid architectures– Each compute node is a shared-memory multiprocessor– A large number of compute nodes is connected through an
interconnect network
CPU
Mem
Interconnect
CPU
CPU
CPU
GPU GPU
CPU
Mem
CPU
CPU
CPU
GPU GPU
CPU
Mem
CPU
CPU
CPU
GPU GPU
CPU
Mem
CPU
CPU
CPU
GPU GPU
Parallel Architectures 57
Distributed memory systemIBM BlueGene / Q @ CINECA
Architecture10 BGQ FrameModelIBM-BG/QProcessor TypeIBM PowerA2, 1.6 GHzComputing Cores163840Computing Nodes10240RAM1GByte / coreInternal Network5D TorusDisk Space2PByte scratch spacePeak Performance2PFlop/s
Parallel Architectures 62
SANDIA ASCI REDDate: 1996Peak performance:1.8TeraflopsFloor space:150m2
Power consumption:800.000 Watt
Parallel Architectures 63
Sony PLAYSTATION 3Date: 2006Peak performance:>1.8TeraflopsFloor space:0.08m2
Power consumption:<200 Watt
SANDIA ASCI REDDate: 1996Peak performance:1.8TeraflopsFloor space:150m2
Power consumption:800.000 Watt
Parallel Architectures 65
Empirical rules
● When writing parallel applications (especially on distributed-memory architectures) keep in mind that:– Computation is fast– Communication is slow– Input/output is incredibly slow
Parallel Architectures 66
Recap
Shared memory● Advantages:
– Easier to program– Useful for applications with
irregular data access patterns (e.g., graph algorithms)
● Disadvantages:– The programmer must take
care of race conditions– Limited memory bandwidth
Distributed memory● Advantages:
– Highly scalable, provide very high computational power by adding more nodes
– Useful for applications with strong locality of reference, with high computation / communication ratio
● Disadvantages:– Latency of interconnect
network– Difficult to program