View
217
Download
1
Category
Tags:
Preview:
Citation preview
-1- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Parallel Computer Architectures
2nd weekReferencesFlynn’s Taxonomy Classification of Parallel Computers
Based on ArchitecturesTrend in Parallel Computer
Architectures
-2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
REFERENCES
Scalable Parallel Computing– Kai Hwang and Zhiwei Xu, McGraw-Hill
Chapter.1Parallel Computing: Theory and Practice
– Michael J. Quinn, McGraw-HillChapter 3
Parallel Processing Course– Yang-Suk Kee(yskee@iris.snu.ac.kr)
School of EECS, Seoul National University
-3- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
FLYNN’S TAXONOMY
A way to classify parallel computers (Flynn,1972)
Based on notions of instruction and data streams– SISD (Single Instruction stream over a Single Data stream )– SIMD (Single Instruction stream over Multiple Data streams )– MISD (Multiple Instruction streams over a Single Data
stream)– MIMD (Multiple Instruction streams over Multiple Data
stream) Popularity
– MIMD > SIMD > MISD
-4- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
SISD (Single Instruction Stream Over A Single Data
Stream )
CU PU MU
IS
IS DSI/O
IS : Instruction Stream DS : Data StreamCU : Control Unit PU : Processing UnitMU : Memory Unit
SISD – Conventional sequential machines
-5- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
SIMD (Single Instruction Stream Over Multiple Data
Streams )SIMD
– Vector computers– Special purpose computations
CU
PE1 LM1
PEn LMn
DS
DS
DS
DS
ISIS
Program loaded from host
Data sets loaded from host
SIMD architecture with distributed memory
PE : Processing Element LM : Local Memory
-6- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
MISD – Processor arrays, systolic arrays– Special purpose computations
Is it different from pipeline computer?
MISD (Multiple Instruction Streams Over A Single Data
Streams)
Memory(Program,
Data) PU1 PU2 PUn
CU1 CU2 CUn
DS DS DSIS IS
IS IS
IS
DSI/O
MISD architecture (the systolic array)
-7- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
MIMD (Multiple Instruction Streams Over Multiple Data
Stream)MIMD
– General purpose parallel computers
CU1 PU1
SharedMemory
IS
IS DSI/O
CUn PUnIS DSI/O
IS
MIMD architecture with shared memory
-8- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
CLASSIFICATION BASED ON ARCHITECTURES
Pipelined ComputersDataflow ArchitecturesData Parallel SystemsMultiprocessorsMulticomputers
-9- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Instructions are divided into a number of steps (segments, stages)
At the same time, several instructions can be loaded in the machine and be executed in different steps
PIPELINED COMPUTERS
Instruction i IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4
Instruction # 1 2 3 4 5 6 7 8
Cycles
-10- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DATAFLOW ARCHITECTURES
Data-driven model– A program is represented as a directed acyclic graph in
which a node represents an instruction and an edge represents the data dependency relationship between the connected nodes
– Firing rule: A node can be scheduled for execution if and only if its input data become valid for consumption
Dataflow languages– Id, SISAL, Silage, ...– Single assignment, applicative(functional) language, no
side effect– Explicit parallelism
-11- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DATAFLOW GRAPH
z = (a + b) * c+
*
a
b
c
z
The dataflow representation of an arithmetic expression
-12- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DATA-FLOW COMPUTER
Execution of instructions is driven by data availability
Basic components– Data are directly held inside instructions– Data availability check unit– Token matching unit– Chain reaction of asynchronous instruction
executionsWhat is the difference between this and
normal (control flow) computers?
-13- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DATAFLOW COMPUTERS
Advantages– Very high potential for parallelism– High throughput – Free from side-effect
Disadvantages– Time lost waiting for unneeded arguments– High control overhead– Difficult in manipulating data structures
-14- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Dataflow Machine (Manchester Dataflow
Computer)
From host
To host
First actual hardware implementation
I/OSwitch
MatchingUnit
OverflowUnit
TokenQueue
InstructionStore
func1 funck
Network
Token<data, tag, dest, marker>
Match<tag, dest>
-15- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DATAFLOW REPRESENTATION
input d,e,f c0 = 0
for i from 1 to 4 do begin ai := di / ei
bi := ai * fi
ci := bi + ci-1
end
output a, b, c
/
d1 e1
*
+
f1
/
d2 e2
*
+
f2
/
d3 e3
*
+
f3
/
d4 e4
*
+
f4
a1
b1
a2
b2
a3
b3
a4
b4
c0 c4
c1 c2 c3
-16- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
EXECUTION ON CONTROL FLOW MACHINES
a1 b1 c1 a2 b2 c2 a4 b4 c4
Sequential execution on a uniprocessor in 24 cycles
Assume all the external inputs are available before entering do loop+ : 1 cycle, * : 2 cycles, / : 3 cycles,
How long will it take to execute this program on a dataflow computer with 4 processors?
-17- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
EXECUTION ON A DATAFLOW MACHINE
c1 c2 c3 c4a1
a2
a3
a4
b1
b2
b4
b3
Data-driven execution on a 4-processor dataflow computer in 9 cycles
Can we further reduce the execution time of this program ?
-18- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
PROBLEMS OF DATAFLOW COMPUTERS
Excessive copying of large data structures in dataflow operations– I-structure : a tagged memory unit for
overlapped usage by the producer and consumer
Retreat from pure dataflow approach (shared memory)
Handling complex data structures
Chain reaction control is difficult to implement– Complexity of matching store and memory
units Expose too much parallelism (?)
-19- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DATA PARALLEL SYSTEMS
Programming model – Operations performed in parallel
on each element of data structure– Logically single thread of control,
performs sequential or parallel steps
– Conceptually, a processor associated with each data element
-20- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DATA PARALLEL SYSTEMS (cont’d)
SIMD Architectural model– Array of many simple, cheap processors
with little memory eachProcessors don’t sequence through
instructions– Attached to a control processor that
issues instructions– Specialized and general communication,
cheap global synchronization
-21- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
VECTOR PROCESSOR
Instruction set includes operation on vectors as well as scalars
2 types of vector computers– Processor arrays– Pipelined vector processors
-22- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
PROCESSOR ARRAY
A sequential computer connected with a set of identical processing elements simultaneouls doing the same operation on different data. Eg CM-200
Data path
Processing element Data
memory
Inte
rcon
nect
ion
netw
ork
Processing element Data
memory
Processing element Data
memory
I/0
…
…Processor array
Instruction path
Program and Data Memory
CPU
I/O processor
Front-end computer
I/0
-23- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
PIPELINED VECTOR PROCESSOR
Stream vector from memory to the CPU
Use pipelined arithmetic units to manipulate data
Eg: Cray-1, Cyber-205
-24- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
MULTIPROCESSOR
Consists of many fully programmable processors each capable of executing its own program
Shared Address Space ArchitectureClassified into 2 types
– Uniform Memory Access (UMA) Multiprocessors
– Non-Uniform Memory Access (NUMA) Multiprocessors
-25- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
SHARED ADDRESS SPACE ARCHITECTURE
Virtual address spaces for a collection of processes communicating via shared addresses
Machine physical address space
Shared portion of address space
Private portion of address space
Pn private
Common physical addresses
P2 private
P1 private
P0 private
Store
Load
-26- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Uses a central switching mechanism to reach a centralized shared memory
All processors have equal access time to global memory
Tightly coupled system Problem: cache consistency
UMA MULTIPROCESSOR
P1
Switching mechanism
I/O
P2 … Pn
Memory banks
C1 C2 CnPi
Ci
Processor i
Cache i
-27- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
UMA MULTIPROCESSOR (cont’d)
Crossbar switching mechanism
Mem
Mem
Mem
Mem
Cache
P
I/OCache
P
I/O
-28- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
UMA MULTIPROCESSOR (cont’d)
Shared- Bus switching mechanism
Mem Mem Mem Mem
Cache
P
I/OCache
P
I/O
-29- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
UMA MULTIPROCESSOR (cont’d)
Packet-switched network
Network
Mem Mem
Cache
P
Cache
P
Cache
P
Mem
-30- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
NUMA MULTIPROCESSOR
Distributed shared memory combined by local memory of all processors
Memory access time depends on whether it is local to the processor
Caching shared (particularly nonlocal) data?
Network
Cache
P
Mem Cache
P
Mem
Cache
P
Mem Cache
P
Mem
Distributed Memory
-31- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
CURRENT TYPES OF MULTIPROCESSORS
PVP (Parallel Vector Processor)– A small number of proprietary vector processors
connected by a high-bandwidth crossbar switch SMP (Symmetric Multiprocessor)
– A small number of COTS microprocessors connected by a high-speed bus or crossbar switch
DSM (Distributed Shared Memory)– Similar to SMP– The memory is physically distributed among
nodes.
-32- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
PVP (Parallel Vector Processor)
VP VP
Crossbar Switch
VP
SM SM SM
VP : Vector Processor SM : Shared Memory
-33- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
SMP (Symmetric Multi-Processor)
P/C P/C
Bus or Crossbar Switch
P/C
SM SM SM
P/C : Microprocessor and Cache
-34- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
DSM (Distributed Shared Memory)
Custom-Designed Network
MB MB
P/C
LM
NIC
DIR
P/C
LM
NIC
DIR
DIR : Cache Directory
-35- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
MULTICOMPUTERS
Consists of many processors with their own memory No shared memory Processors interact via message passing loosely coupled
system
Message-Passing Interconnection
Network
MP
MP
MP
PM
PM
PM
PM
PM
…
P
P
…
M
M
…
…
-36- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
CURRENT TYPES OF MULTICOMPUTERS
MPP (Massively Parallel Processing)– Total number of processors > 1000
Cluster– Each node in system has less than 16
processors.Constellation
– Each node in system has more than 16 processors
-37- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
MPP (Massively Parallel Processing)
P/C
LM
NIC
MB
P/C
LM
NIC
MB
Custom-Designed Network
MB : Memory Bus NIC : Network Interface Circuitry
-38- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Cluster
Commodity Network (Ethernet, ATM, Myrinet, VIA)
MB MB
P/C
M
NIC
P/C
M
Bridge Bridge
LD LD
NIC
IOB IOB
LD : Local Disk IOB : I/O Bus
-39- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
CONSTELLATION
P/CP/C
SM SMNICLD
Hub
Custom or Commodity Network
>= 16
IOC
P/CP/C
SM SMNICLD
Hub
>= 16
IOC
IOC : I/O Controller
-40- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Trend in Parallel Computer Architectures
0
50
100
150
200
250
300
350
400
1997 1998 1999 2000 2001 2002 Years
Num
ber
of H
PC
s
MPPs Constellations Clusters SMPs
-41- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
TOWARD ARCHITECTURAL CONVERGENCE
Evolution and role of software have blurred boundary– Send/recv supported on SAS machine via buffers– Can construct global address space on MP
Hardware organization converging too.– Tighter NI integration for MP (low latency, higher bandwidth)– At lower level, hardware SAS passes hardware messages.– Nodes connected by general network and communication
assists. Cluster of workstations/SMPs
– Emergence of fast SAN (System Area Networks)
-42- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM
Converged Architecture of Current
Supercomputers
Interconnection NetworkInterconnection Network
MemoryMemory
PP PP PP PP
MemoryMemory
PP PP PP PP
MemoryMemory
PP PP PP PP
MemoryMemory
PP PP PP PP
MultiprocessorsMultiprocessorsMultiprocessorsMultiprocessors
Recommended