Upload
vocong
View
227
Download
0
Embed Size (px)
Citation preview
Master Program (Laurea Magistrale) in Computer Science and Networking
High Performance Computing Systems and Enabling Platforms
Marco Vanneschi
Precourse in Computer Architecture Fundamentals
2.
Firmware Level
Firmware level
• At this level, the system is viewed as a collection of cooperating modules calledPROCESSING UNITS (simply: units).
• Each Processing Unit is
– autonomous
• has its own control, i.e. it has self-control capability, i.e. it is an active computational entity
– described by a sequential program, called microprogram.
• Cooperation is realized through COMMUNICATIONS
– Communication channels implemented by physical links and a firmware protocol.
• Parallelism among units.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 2
U1
Uj
Un
U2
. . . . . .
. . .
Control Part j
Operating Part j
Examples
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 3
Cache Memory
.
.
.
Main Memory
MemoryManagement
Unit
ProcessorInterrupt Arbiter
I/O1 I/Ok. . .
. . .
CPU
A uniprocessor architecture
In turn, CPU is a collection of
communicating units (on the
same chip)
Cache Memory
In turn, the Cache Memory can be
implemented as a collection of
parallel units: instruction and data
cache, primary – secondary –
tertiary… cache
In turn, the Main Memory can be implemented as a
collection of parallel units: modular memory,
interleaved memory, buffer memory, or …
In turn, the Processor can be
implemented as a (large)
collection of parallel, pipelined
units
In turn, an advanced
I/O unit can be
a parallel architecture
(vector / SIMD
coprocessor / GPU /
…),
or a powerful
multiprocessor-based
network interface,
or ….
Firmware interpretation of assembly language
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 4
Hardware
Applications
Processes
Assembler
Firmware
Uniprocessor : Instruction Level Parallelism
Shared Memory Multiprocessor: SMP, NUMA,…
Distributed Memory : Cluster, MPP, …
• The true architectural level.• Microcoded interpretation of assembler
instructions.• This level is fundamental (it cannot be
eliminated).
• Physical resources: combinatorial and sequential circuit components, physical links
I
I
The firmware interpretation of any assembly
language instructions has a decentralized
organization, even for simple uniprocessor
systems.
The interpreter functionalities are partitioned
among the various processing units (belonging to
CPU, M, I/O subsystems), each unit having its
own Control Part and its own microprogram.
Cooperation of units through explicit message
passing .
Processing Unit model: Control + Operation
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 5
PC and PO are automata,implemented by synchronous sequential networks (sequential circuits)
Clock cycle: time interval t for executing a (any) microinstruction. Synchronous model of computation.
Externaloutputs
Externalinputs
Control Part
(PC)
Operating Part
(PO)
Condition variables
{x}
Control variables
g = {a} {b}
Clock signal
Example of a simple unit design
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 6
Specification IN: OUT = number_of_occurrences (IN, A)
Pseudo-code of interpreter:while (true) do
B = IN; C = 0;for (J = 0; J < N; J ++)
if ( A[ J ] == B) then C = C + 1;OUT = C
A[N]
IN (stream)
32 32
U
OUT (stream)
Integer array: implemented as a registerRAM/ROM
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 7
Synchronized registers
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 8
when clock do R_output = R_input
R_input
R_output = R_present_state
In PO, the clock signal is AND-ed with a control variable (b), in oder to enable register-writingexplicitly by PC.
Register_writing
RR_output = R_present_state
clock (pulse signal)
R_input
Register output and state remain STABLE duringa clock cycle
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 9
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
Example of PO structure
Arithmetic-logic
combinatorial
circuit
Register array
(RAM/ROM), N words,
32 bits/word
Register (32 bits)
Multiplexer
to PC
from PC
B
Write-enable signal, from PC
1 bit register
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
• The microprogram is expressed by means of register-transfer micro-operations.
– Example: A + B C
– i.e., the contents of registers A and B are added, and the result is written into register C.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 10
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 11
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
Example of micro-operation
Data pathfor the execution of
zero (A[J] – B) Z
B
= 0
= 1
= 0
= 0 = 0 = 0
= 1
= - = -
Executed in oneclock cycle:all the interestedcombinatorialcircuits becomestable during the clock cycle.
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
• The microprogram is expressed by means of register-transfer micro-operations.
– Example: A + B C
– i.e., the contents of registers A and B are added, and the results is written into register C.
• Parallelism inside micro-operations. – Example: A + B C, D – E D
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 12
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 13
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
Data pathfor the execution of
J + 1 J, C + 1 C
B
= 1
= 0
= 1
= 1
= 1
= 1
= 0
= 0
= 0
Example of parallel micro-operation
These register-transferoperations are logicallyindependent, and theycan be actuallyexecuted in parallel ifproper PO resourcesare provided (e.g., twoALUs).Both operations are executed during the same clock cycle.
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C
– i.e., the contents of registers A and B are added, and the results is written into register C.
• Parallelism inside micro-operations. Example: A + B C, D – E F
• Conditional microistructions: if … then … else, switch case …
• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 14
Example of microistruction
i. if (Asign = 0)
then A + B A, J + 1 J, goto j
else A – B B, goto k
j. …
…
k. …
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 15
Currentmicroistructionlabel
Logical conditionon one or more conditionvariables(Asign denotes the most significantbit of register A)
(Parallel) micro-operation
Label of nextmicroistruction
Control Part (PC) is anautomaton, whose state graphcorresponds to the structure ofthe microprogram:
i
j
k
Asign = 0: execute A + B A, J + 1 J in PO, and goto j
Asign = 1: execute A - B B in PO, and goto k
Approach to processing unit design
• PO structure is composed of– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and few others
• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C– i.e., the contents of registers A and B are added, and the results is written into register C.
• Parallelism inside micro-operations. Example: A + B C, D – E F
• Conditional microistructions: if … then … else, switch case …
• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.
• Each microinstruction is executed in a clock cycle.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 16
Generation of propervalues ofcontrolvariables
Clock cycle: informal meaning
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 17
PC
PO
pulse width
(stabilization of register
contents)
{values of control variables
clock frequency f = 1/t e.g. t = 0,25 nsec f = 4 GHz
tExecution ofA + B C(A, B, C: registers ofPO)
Execution ofinput_C = A + B
Execution ofA + B C, D + 1 Din parallel in thesame clock cycle
Execution ofinput_D = D + 1
C = input_CD = input_D
Clock cycle, t = maximum stabilitation delay time for the execution of a (any) microinstruction, i.e. for the stabilization of the inputs of all the PC and PO registers
input_C C
Register C
Clock cycle: formally
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 18
TwPC
PC
PO
TwPOTsPO
TsPC
d pulse width
{x}
{g}
clock cycle t = TwPO + max (TwPC + TsPO, TsPC) + d
wA: output function ofautomata A
sA: state transitionfunction of automata A
TF: stabilization delay ofthe combinatorialnetwork implementingfunction F
Example of a simple unit design
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 19
Specification IN: OUT = number_of_occurrences (IN, A)
Pseudo-code of interpreter:while (true) do
B = IN; C = 0;for (J = 0; J < N; J ++)
if ( A[ J ] == B) then C = C + 1;OUT = C
Microprogram (= interpreter implementation)
0. IN B, 0 C, 0 J, goto 1
1. if (Jmost = 0) then zero (A[Jmodule] – B) Z, goto 2
else C OUT, goto 0
2. if (Z = 0) then J + 1 J, goto 1
else J + 1 J, C + 1 C, goto 1
// to be completed with synchronized communications //
A[N]
IN (stream)
32 32
U
Service time of U: Tcalc = k t = ( 2 + 2 N ) t 2N t
N times
N times
1 time
0 1 2
1 time
Control Part state graph (automaton):
Jmost = 0
Jmost = 1
OUT (stream)
Integer array: implemented as a registerRAM/ROM
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 20
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
PO structure
Arithmetic-logic
combinatorial
circuit
Register array
(RAM/ROM), N words,
32 bits/word
Register (32 bits)
Multiplexer
to PC
from PC
B
Write-enable signal, from PC
1 bit register
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 21
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
PO structure
Data pathfor the execution of
J + 1 J, C + 1 C
B
= 1
= 0
= 1
= 1
= 1
= 1
= 0
= 0
= 0
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 22
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
PO structure
Data pathfor the execution of
zero (A[J] – B) Z
B
= 0
= 1
= 0
= 0 = 0 = 0
= 1
= - = -
Service time, and other performance parameters
• Computations on streams
• Stream: (possibly infinite) sequence of values with the same type, also called tasks
• Service time: average time interval between the beginning of processing of twoconsecutive tasks, i.e. between two consecutive services
• Latency: average duration of a single task processing (from input acquisition to resultgeneration) : calculation time
– Service time is equal to latency for sequential computations only (a single module,
or unit)
• Processing Bandwidth = 1/service_time: average number of processed tasks per second (e.g. images per second, operations per second, instructions per second, bitstransmitted per second, …) – also called throughput for generic applications, or performance for CPUs (instructions or float operations per second)
• Completion time, for a finite lenght (m) stream: time interval needed to fully processall the tasks. For large m and parallel computations, often the following approximaterelation holds:
completion_time m service_time
equality relation is rigorous for sequential computationsMCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 23
Communications at the firmware level
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 24
Dedicated links (one-to-one, unidirectional)
Shared link: bus (many-to-many, bidirectional)It can support one-to-one, one-to-many, and many-to-one communications.
In general, for computer structures : parallel links(e.g. one or more wordsper link)
Valid for• uniprocessor computer, • multiprocessor with
shared memory, • some multicomputers
with distributed memory(MPP)
Not valid for• distributed architectures
with standard communication networks,
• some kind of simple I/O.
Basic firmware structure for dedicated links
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 25
Usource Udestination
Output
interfaceInput interface
LINK
Usource::
while (true) do
< produce message MSG >;
< send MSG to Udestination >
Udestination::
while (true) do
< receive MSG from Usource >;
< process MSG >
Ready – acknowledgement protocol
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 26
RDY line (1 bit)
ACK line (1 bit)
U1
O
U
TU2
I
NMSG (n bit)
< send MSG > ::
wait ACK;
write MSG into OUT, signal RDY,
…
< receive MSG >::
wait RDY;
utilize IN, signal ACK, …
Asynchronous communication : asynchrony degree = 1
Implementation
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 27
U1
OUT U2
I
N
ACK
(1 bit)
RDYOUT
ACKOUT
RDYIN
ACKIN
MSG (n bit)
bRDYOUT
bACKOUT
bRDYIN
bACKIN
RDY
(1 bit)
< send MSG > ::
wait until ACKOUT;
MSG OUT, set RDYOUT, reset ACKOUT
< receive MSG >::
wait until RDYIN;
f( IN, …) …… , set ACKIN, reset RDYIN
Level-transition interface : detects a leveltransition on input line, provided that it occurswhen S = Y
clock
RDY
line exor
RDYIN
S
Y
Module-2 counter
set X: complements X content (a level transition is produced onto output link)reset X: forces X output to zero (by forcing element Y to become equal to S)
Example of a simple unit design withcommunication
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 28
Specification IN: OUT = number_of_occurrences (IN, A)
Pseudo-code of interpreter:while (true) do
receive (IN, B); C = 0;for (J = 0; J < N; J ++)
if ( A[ J ] == B) then C = C + 1;send (OUT, C)
A[N]
IN
32 32
U
OUT
RDYIN RDYOUT
ACKIN ACKOUT
Microprogram (= interpreter implementation)
0. if (RDYIN = 0) then nop, goto 0
else reset RDYIN, set ACKIN, IN B, 0 C, 0 J, goto 1
1. case (Jmost ACKOUT = 0 - ) do zero (A[Jmodule] – B) Z, goto 2
(Jmost ACKOUT = 1 0 ) do nop, goto 1
(Jmost ACKOUT = 1 1 ) do reset ACKOUT, set RDYOUT, C OUT, goto 0
2. if (Z = 0) then J + 1 J, goto 1
else J + 1 J, C + 1 C, goto 1
Impact of communication protocol on performance parameters
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 29
1 time
N times
N times
1 time
0 1 2
Control Part state graph (automaton):
Jmost = 0
Jmost = 1 and ACKOUT = 1
Jmost = 1 and ACKOUT = 0
RDYIN = 0
RDYIN = 1
Synchronization implies waitconditions (wait RDYIN, waitACKOUT), corresponding to arcswith the same source and destination node (stable internalstates) in the state graph.
• In the evaluation of the internal calculation time of U, these wait conditions are not to be taken into account.• That is, we are interested in evaluating the processing capabilities of U
independently of the current situation of the partner units: as if the source and destination units were always ready to communicate.
• On the other hand, the communication protocol has “no cost”, i.e. no overhead.• Thus, the calculation time of U is again Tcalc = 2Nt.
• However, communication might have impact on the service time.
Communication: performance parameters
• Because of communication impact, we distinguish between:
– Ideal service time = internal calculation time
– Effective service time, or simply service time
• Overhead of RDY-ACK protocol at the firmware level, for dedicatedlinks: null for calculation time (ideal service time) !
– all the required operations for RDY-ACK protocol are executed in a single
clock cycle, in parallel to operations of the true computation.
• Communication latency: time needed to complete a communication, i.e. for the ACK reception.
• Service time: considering the effect of communication, it dependson the calculation time and on the communication latency:
– once executed a send operation, the sender can execute a new send operation
after the communication latency time interval (not before).
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 30
Communication latencyand impact on service time
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 31
ACKRDY RDY
Ttrt
Tcalc Tcom
Sender
Receiver
Clock cycle for calculation only
Clock cycle for calculation
and communication
Communication Latency
Lcom = 2 (Ttr + t)
Transmission Latency(Link only)
Communication time NOT overlapped to (i.e. not masked by) internal calculation
Calculation time
Service time T = Tcalc + Tcom
…
…
Tcalc ≥ Lcom Tcom = 0
Lcom
T
“Coarse grain” and “fine grain” computations
• Two possible situations: Tcalc Lcom or Tcalc < Lcom
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 32
Tcalc
Lcom
Tcalc Lcom
Tcom = 0Service time T = Tcalc
Tcalc
Lcom
Tcalc < Lcom
Tcom > 0Service time T = Lcom
(Coarse Grain)
(Fine Grain)
T = Tcalc + Tcom = max (Tcalc, Lcom)
In the example
Tcalc = 2Nt
Let: Ttr = 9 t
Then: Lcom = 2 (t + Ttr) = 20 t
For N 10: Tcalc Lcom
In this case:
Tcom = 0, service time T = Tcalc = 2Nt
i.e. “coarse grain” calculation (wrt communications)
However, for relatively low N, provided that N 10, it is a “fine grain”
computation (e.g. few tens clock cycles) able to fully mask communications.
For N < 10: Tcalc < Lcom
Tcom > 0, service time T = Lcom = 20t
i.e. “too fine” grain calculation.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 33
Communication latency and impact on service time
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 34
Comm. Latency Lcom = 2 (Ttr + t)
Service time T = Tcalc + TcomTcalc ≥ Lcom Tcom = 0
efficiency e 1,
ideal processing bandwidth (throughput)
• Impact of Ttr on service time T is significant
• e.g. memory access time = response time of “external” memory to processor
requests (“external” = outside CPU-chip)
• Inside the same chip: Ttr 0,
• e.g. between processor – MMU – cache
• except for on-chip interconnection structures with “long wires” (e.g. buses,
rings, meshes, …)
• With Ttr 0, very “fine grain” computations are possible without communicationpenalty on service time: Lcom = 2t if Tcalc ≥ 2t then T = Tcalc
• Important application in Instruction Level Parallelism processors and in Interconnection Network switch-units.
Request-reply communication
Example: a simple cooperation between a processor Pand a memory unit M
(read or write request from P to M, reply from M to P)
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 35
t
tM
Ttr
P-M request-reply latency: ta = 2 (t + Ttr) + tM
P
M
Memory access time (as “seen” by the processor)
M
P
Integrated hardware technologies
• Integration on silicon semiconductor CHIPS
• Chip area vs clock cycle:– Each logical HW component takes up a given area on silicon (currently: order of fractions of
nanometers).
– Increasing the number of components per chip, i.e. increasing chip area, is positive providedthat components are intensively “packed”.
– Relatively long links on chip are the main source of inefficiency.
– Some macro-components are more suitable for intensive packing (notably: memories), othersare very critical (e.g. large combinatorial networks realized by AND-OR gates).
– Pin count problem: increasing the number of interfaces increases the chip perimeter linearly, thus the chip area increases quadratically. Many interfaces (e.g. over 102 -103 pins) are motivated only if the area increase is properly exploited by a quadratic increase in intensivelypacked HW compenents.
• Technology of CPUs vs technology of memories and links:– CPU clock cycle: t 100 nsec (frequency: few GHz)
– Static Main Memory clock cycle: tM 101 – 102 t
– Dynamic Main Memory clock cycle: tM 102 – 104 t
– Inter-chip link transmission latency: Ttr 100 – 102 t
– Thus, ta is (one or) more order of magnitude greater than the CPU clock cycle.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 36
The processor-memory gap (Von Neumann bottleneck)
• For every (known) technology, the ratio ta/t is increasing at every technological improvement– the frequency increase of CPU clock is greater than the frequency increase of
memory clock and of links (inverse of unit latency of links).
• This is the central problem of computer architecture(yesterday, today, tomorrow)– SOLUTIONS ARE BASED ON MEMORY HIERACHIES AND ON PARALLELISM
• “Moore law” (the CPU frequency doubles every 1.5 years) hasbeen valid until 2004 (we’ll see why). Memory and linkstechnologies has been “stopped” too.– Multi-core evolution / revolution.
– The Moore law is now applied to the number of cores (CPUs) in multi-coreand many-core chips.
– What about programming ? One main motivation of SPA and SPM courses.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 37
Interconnection structuresbased on dedicated links: first examples
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 38
E
E
E
U1
U2
U3
U4
C
C
C
input
stream
output
stream
Trees:Inteconnection of a single input stream and of a single output stream to a collection of nindependent units.
Communication Latency = O(lg n)
Rings /Tori: Communication Latency = O(n)
U1 U2 U3 U4
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 39
Interconnection structuresbased on dedicated links: first examples
Meshes:
U U U
U U U
U U U
Communication Latency = O(n1/2)
possibly toroidal
Interconnection structuresbased on dedicated links• In general, all the most interesting interconnection structures for
– shared memory multiprocessors
– distributed memory multicomputer, cluster, MPP
are “limited degree” networks– based on dedicated links
– with limited number of links per unit: CHIP COUNT problem
• Examples (Course Part 2):– n-dimension cubes
– n-dimensione butterflys
– n-dimension Trees / Fat Trees
– rings and multi-rings
• Currently the “best” commercial networks belong to these classes:– Infiniband
– Myrinet
– …
• Trend: on-chip optical interconnection networks
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 40
(Traditional, old-style) Bus
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 41
Sender behaviour Receiver behaviour
Shared, bidirectionallink.At most one messageat the time can betransmitted.
Arbitrationmechanisms must beprovided.
Serious problems ofbandwidth(contention) for bus-based parallelstructures.
Arbiter mechanisms (for bus writing)
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 42
Independent Request
Arbiters
• Daisy Chaining
• Polling (similar)
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 43
Decentralized arbiters
• Token ring
• Non-deterministic with collision detection
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 44
Firmware: summary
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 45
Hardware
Applications
Processes
Assembler
Firmware
Uniprocessor : Instruction Level Parallelism
Shared Memory Multiprocessor: SMP, NUMA,…
Distributed Memory : Cluster, MPP, …
• The true architectural level.• Microcoded interpretation of assembler
instructions.• This level is fundamental (it cannot be
eliminated).
• Physical resources: combinatorial and sequential circuit components, physical links
I
I
• Decentralized control of firmware interpreter
• Synchronous computation model
• Clock cycle
• Parallelism inside microinstructions
• Clear cost model for calculation and for
communication
• Efficient communications can be implemented
on dedicated links
• Communication-internal calculation
overlapping