Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Master Program (Laurea Magistrale) in Computer Science and Networking

High Performance Computing Systems and Enabling Platforms

Marco Vanneschi

Precourse in Computer Architecture Fundamentals

2.

Firmware Level

Firmware level

• At this level, the system is viewed as a collection of cooperating modules calledPROCESSING UNITS (simply: units).

• Each Processing Unit is

– autonomous

• has its own control, i.e. it has self-control capability, i.e. it is an active computational entity

– described by a sequential program, called microprogram.

• Cooperation is realized through COMMUNICATIONS

– Communication channels implemented by physical links and a firmware protocol.

• Parallelism among units.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 2

U1

Uj

Un

U2

. . . . . .

. . .

Control Part j

Operating Part j

Examples


Cache Memory

.

.

.

Main Memory

MemoryManagement

Unit

ProcessorInterrupt Arbiter

I/O1 I/Ok. . .

. . .

CPU

A uniprocessor architecture

In turn, CPU is a collection of

communicating units (on the

same chip)

Cache Memory

In turn, the Cache Memory can be

implemented as a collection of

parallel units: instruction and data

cache, primary – secondary –

tertiary… cache

In turn, the Main Memory can be implemented as a

collection of parallel units: modular memory,

interleaved memory, buffer memory, or …

In turn, the Processor can be

implemented as a (large)

collection of parallel, pipelined

units

In turn, an advanced

I/O unit can be

a parallel architecture

(vector / SIMD

coprocessor / GPU /

…),

or a powerful

multiprocessor-based

network interface,

or ….

Firmware interpretation of assembly language


Hardware

Applications

Processes

Assembler

Firmware

Uniprocessor : Instruction Level Parallelism

Shared Memory Multiprocessor: SMP, NUMA,…

Distributed Memory : Cluster, MPP, …

• The true architectural level.• Microcoded interpretation of assembler

instructions.• This level is fundamental (it cannot be

eliminated).

• Physical resources: combinatorial and sequential circuit components, physical links

I

I

The firmware interpretation of any assembly

language instructions has a decentralized

organization, even for simple uniprocessor

systems.

The interpreter functionalities are partitioned

among the various processing units (belonging to

CPU, M, I/O subsystems), each unit having its

own Control Part and its own microprogram.

Cooperation of units through explicit message

passing .

Processing Unit model: Control + Operation


PC and PO are automata,implemented by synchronous sequential networks (sequential circuits)

Clock cycle: time interval t for executing a (any) microinstruction. Synchronous model of computation.

Externaloutputs

Externalinputs

Control Part

(PC)

Operating Part

(PO)

Condition variables

{x}

Control variables

g = {a} {b}

Clock signal

Example of a simple unit design


Specification IN: OUT = number_of_occurrences (IN, A)

Pseudo-code of interpreter:while (true) do

B = IN; C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;OUT = C

A[N]

IN (stream)

32 32

U

OUT (stream)

Integer array: implemented as a registerRAM/ROM

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others


Synchronized registers


when clock do R_output = R_input

R_input

R_output = R_present_state

In PO, the clock signal is AND-ed with a control variable (b), in oder to enable register-writingexplicitly by PC.

Register_writing

RR_output = R_present_state

clock (pulse signal)

R_input

Register output and state remain STABLE duringa clock cycle


BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Example of PO structure

Arithmetic-logic

combinatorial

circuit

Register array

(RAM/ROM), N words,

32 bits/word

Register (32 bits)

Multiplexer

to PC

from PC

B

Write-enable signal, from PC

1 bit register







few others

• The microprogram is expressed by means of register-transfer micro-operations.

– Example: A + B C

– i.e., the contents of registers A and B are added, and the result is written into register C.



BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Example of micro-operation

Data pathfor the execution of

zero (A[J] – B) Z

B

= 0

= 1

= 0

= 0 = 0 = 0

= 1

= - = -

Executed in oneclock cycle:all the interestedcombinatorialcircuits becomestable during the clock cycle.







few others

• The microprogram is expressed by means of register-transfer micro-operations.

– Example: A + B C

– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. – Example: A + B C, D – E D



BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm


J + 1 J, C + 1 C

B

= 1

= 0

= 1

= 1

= 1

= 1

= 0

= 0

= 0

Example of parallel micro-operation

These register-transferoperations are logicallyindependent, and theycan be actuallyexecuted in parallel ifproper PO resourcesare provided (e.g., twoALUs).Both operations are executed during the same clock cycle.







few others

• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C

– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. Example: A + B C, D – E F

• Conditional microistructions: if … then … else, switch case …

• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.


Example of microistruction

i. if (Asign = 0)

then A + B A, J + 1 J, goto j

else A – B B, goto k

j. …

…

k. …


Currentmicroistructionlabel

Logical conditionon one or more conditionvariables(Asign denotes the most significantbit of register A)

(Parallel) micro-operation

Label of nextmicroistruction

Control Part (PC) is anautomaton, whose state graphcorresponds to the structure ofthe microprogram:

i

j

k

Asign = 0: execute A + B A, J + 1 J in PO, and goto j

Asign = 1: execute A - B B in PO, and goto k


• PO structure is composed of– Registers: to implement variables



– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and few others

• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. Example: A + B C, D – E F

• Conditional microistructions: if … then … else, switch case …

• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.

• Each microinstruction is executed in a clock cycle.


Generation of propervalues ofcontrolvariables

Clock cycle: informal meaning


PC

PO

pulse width

(stabilization of register

contents)

{values of control variables

clock frequency f = 1/t e.g. t = 0,25 nsec f = 4 GHz

tExecution ofA + B C(A, B, C: registers ofPO)

Execution ofinput_C = A + B

Execution ofA + B C, D + 1 Din parallel in thesame clock cycle

Execution ofinput_D = D + 1

C = input_CD = input_D

Clock cycle, t = maximum stabilitation delay time for the execution of a (any) microinstruction, i.e. for the stabilization of the inputs of all the PC and PO registers

input_C C

Register C

Clock cycle: formally


TwPC

PC

PO

TwPOTsPO

TsPC

d pulse width

{x}

{g}

clock cycle t = TwPO + max (TwPC + TsPO, TsPC) + d

wA: output function ofautomata A

sA: state transitionfunction of automata A

TF: stabilization delay ofthe combinatorialnetwork implementingfunction F

Example of a simple unit design




B = IN; C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;OUT = C

Microprogram (= interpreter implementation)

0. IN B, 0 C, 0 J, goto 1

1. if (Jmost = 0) then zero (A[Jmodule] – B) Z, goto 2

else C OUT, goto 0

2. if (Z = 0) then J + 1 J, goto 1

else J + 1 J, C + 1 C, goto 1

// to be completed with synchronized communications //

A[N]

IN (stream)

32 32

U

Service time of U: Tcalc = k t = ( 2 + 2 N ) t 2N t

N times

N times

1 time

0 1 2

1 time

Control Part state graph (automaton):

Jmost = 0

Jmost = 1

OUT (stream)

Integer array: implemented as a registerRAM/ROM


BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure

Arithmetic-logic

combinatorial

circuit

Register array

(RAM/ROM), N words,

32 bits/word

Register (32 bits)

Multiplexer

to PC

from PC

B

Write-enable signal, from PC

1 bit register


BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure


J + 1 J, C + 1 C

B

= 1

= 0

= 1

= 1

= 1

= 1

= 0

= 0

= 0


BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure


zero (A[J] – B) Z

B

= 0

= 1

= 0

= 0 = 0 = 0

= 1

= - = -

Service time, and other performance parameters

• Computations on streams

• Stream: (possibly infinite) sequence of values with the same type, also called tasks

• Service time: average time interval between the beginning of processing of twoconsecutive tasks, i.e. between two consecutive services

• Latency: average duration of a single task processing (from input acquisition to resultgeneration) : calculation time

– Service time is equal to latency for sequential computations only (a single module,

or unit)

• Processing Bandwidth = 1/service_time: average number of processed tasks per second (e.g. images per second, operations per second, instructions per second, bitstransmitted per second, …) – also called throughput for generic applications, or performance for CPUs (instructions or float operations per second)

• Completion time, for a finite lenght (m) stream: time interval needed to fully processall the tasks. For large m and parallel computations, often the following approximaterelation holds:

completion_time m service_time

equality relation is rigorous for sequential computationsMCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 23

Communications at the firmware level


Dedicated links (one-to-one, unidirectional)

Shared link: bus (many-to-many, bidirectional)It can support one-to-one, one-to-many, and many-to-one communications.

In general, for computer structures : parallel links(e.g. one or more wordsper link)

Valid for• uniprocessor computer, • multiprocessor with

shared memory, • some multicomputers

with distributed memory(MPP)

Not valid for• distributed architectures

with standard communication networks,

• some kind of simple I/O.

Basic firmware structure for dedicated links


Usource Udestination

Output

interfaceInput interface

LINK

Usource::

while (true) do

< produce message MSG >;

< send MSG to Udestination >

Udestination::

while (true) do

< receive MSG from Usource >;

< process MSG >

Ready – acknowledgement protocol


RDY line (1 bit)

ACK line (1 bit)

U1

O

U

TU2

I

NMSG (n bit)

< send MSG > ::

wait ACK;

write MSG into OUT, signal RDY,

…

< receive MSG >::

wait RDY;

utilize IN, signal ACK, …

Asynchronous communication : asynchrony degree = 1

Implementation


U1

OUT U2

I

N

ACK

(1 bit)

RDYOUT

ACKOUT

RDYIN

ACKIN

MSG (n bit)

bRDYOUT

bACKOUT

bRDYIN

bACKIN

RDY

(1 bit)

< send MSG > ::

wait until ACKOUT;

MSG OUT, set RDYOUT, reset ACKOUT

< receive MSG >::

wait until RDYIN;

f( IN, …) …… , set ACKIN, reset RDYIN

Level-transition interface : detects a leveltransition on input line, provided that it occurswhen S = Y

clock

RDY

line exor

RDYIN

S

Y

Module-2 counter

set X: complements X content (a level transition is produced onto output link)reset X: forces X output to zero (by forcing element Y to become equal to S)

Example of a simple unit design withcommunication




receive (IN, B); C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;send (OUT, C)

A[N]

IN

32 32

U

OUT

RDYIN RDYOUT

ACKIN ACKOUT

Microprogram (= interpreter implementation)

0. if (RDYIN = 0) then nop, goto 0

else reset RDYIN, set ACKIN, IN B, 0 C, 0 J, goto 1

1. case (Jmost ACKOUT = 0 - ) do zero (A[Jmodule] – B) Z, goto 2

(Jmost ACKOUT = 1 0 ) do nop, goto 1

(Jmost ACKOUT = 1 1 ) do reset ACKOUT, set RDYOUT, C OUT, goto 0

2. if (Z = 0) then J + 1 J, goto 1

else J + 1 J, C + 1 C, goto 1

Impact of communication protocol on performance parameters


1 time

N times

N times

1 time

0 1 2

Control Part state graph (automaton):

Jmost = 0

Jmost = 1 and ACKOUT = 1

Jmost = 1 and ACKOUT = 0

RDYIN = 0

RDYIN = 1

Synchronization implies waitconditions (wait RDYIN, waitACKOUT), corresponding to arcswith the same source and destination node (stable internalstates) in the state graph.

• In the evaluation of the internal calculation time of U, these wait conditions are not to be taken into account.• That is, we are interested in evaluating the processing capabilities of U

independently of the current situation of the partner units: as if the source and destination units were always ready to communicate.

• On the other hand, the communication protocol has “no cost”, i.e. no overhead.• Thus, the calculation time of U is again Tcalc = 2Nt.

• However, communication might have impact on the service time.

Communication: performance parameters

• Because of communication impact, we distinguish between:

– Ideal service time = internal calculation time

– Effective service time, or simply service time

• Overhead of RDY-ACK protocol at the firmware level, for dedicatedlinks: null for calculation time (ideal service time) !

– all the required operations for RDY-ACK protocol are executed in a single

clock cycle, in parallel to operations of the true computation.

• Communication latency: time needed to complete a communication, i.e. for the ACK reception.

• Service time: considering the effect of communication, it dependson the calculation time and on the communication latency:

– once executed a send operation, the sender can execute a new send operation

after the communication latency time interval (not before).


Communication latencyand impact on service time


ACKRDY RDY

Ttrt

Tcalc Tcom

Sender

Receiver

Clock cycle for calculation only

Clock cycle for calculation

and communication

Communication Latency

Lcom = 2 (Ttr + t)

Transmission Latency(Link only)

Communication time NOT overlapped to (i.e. not masked by) internal calculation

Calculation time

Service time T = Tcalc + Tcom

…

…

Tcalc ≥ Lcom Tcom = 0

Lcom

T

“Coarse grain” and “fine grain” computations

• Two possible situations: Tcalc Lcom or Tcalc < Lcom


Tcalc

Lcom

Tcalc Lcom

Tcom = 0Service time T = Tcalc

Tcalc

Lcom

Tcalc < Lcom

Tcom > 0Service time T = Lcom

(Coarse Grain)

(Fine Grain)

T = Tcalc + Tcom = max (Tcalc, Lcom)

In the example

Tcalc = 2Nt

Let: Ttr = 9 t

Then: Lcom = 2 (t + Ttr) = 20 t

For N 10: Tcalc Lcom

In this case:

Tcom = 0, service time T = Tcalc = 2Nt

i.e. “coarse grain” calculation (wrt communications)

However, for relatively low N, provided that N 10, it is a “fine grain”

computation (e.g. few tens clock cycles) able to fully mask communications.

For N < 10: Tcalc < Lcom

Tcom > 0, service time T = Lcom = 20t

i.e. “too fine” grain calculation.


Communication latency and impact on service time


Comm. Latency Lcom = 2 (Ttr + t)

Service time T = Tcalc + TcomTcalc ≥ Lcom Tcom = 0

efficiency e 1,

ideal processing bandwidth (throughput)

• Impact of Ttr on service time T is significant

• e.g. memory access time = response time of “external” memory to processor

requests (“external” = outside CPU-chip)

• Inside the same chip: Ttr 0,

• e.g. between processor – MMU – cache

• except for on-chip interconnection structures with “long wires” (e.g. buses,

rings, meshes, …)

• With Ttr 0, very “fine grain” computations are possible without communicationpenalty on service time: Lcom = 2t if Tcalc ≥ 2t then T = Tcalc

• Important application in Instruction Level Parallelism processors and in Interconnection Network switch-units.

Request-reply communication

Example: a simple cooperation between a processor Pand a memory unit M

(read or write request from P to M, reply from M to P)


t

tM

Ttr

P-M request-reply latency: ta = 2 (t + Ttr) + tM

P

M

Memory access time (as “seen” by the processor)

M

P

Integrated hardware technologies

• Integration on silicon semiconductor CHIPS

• Chip area vs clock cycle:– Each logical HW component takes up a given area on silicon (currently: order of fractions of

nanometers).

– Increasing the number of components per chip, i.e. increasing chip area, is positive providedthat components are intensively “packed”.

– Relatively long links on chip are the main source of inefficiency.

– Some macro-components are more suitable for intensive packing (notably: memories), othersare very critical (e.g. large combinatorial networks realized by AND-OR gates).

– Pin count problem: increasing the number of interfaces increases the chip perimeter linearly, thus the chip area increases quadratically. Many interfaces (e.g. over 102 -103 pins) are motivated only if the area increase is properly exploited by a quadratic increase in intensivelypacked HW compenents.

• Technology of CPUs vs technology of memories and links:– CPU clock cycle: t 100 nsec (frequency: few GHz)

– Static Main Memory clock cycle: tM 101 – 102 t

– Dynamic Main Memory clock cycle: tM 102 – 104 t

– Inter-chip link transmission latency: Ttr 100 – 102 t

– Thus, ta is (one or) more order of magnitude greater than the CPU clock cycle.


The processor-memory gap (Von Neumann bottleneck)

• For every (known) technology, the ratio ta/t is increasing at every technological improvement– the frequency increase of CPU clock is greater than the frequency increase of

memory clock and of links (inverse of unit latency of links).

• This is the central problem of computer architecture(yesterday, today, tomorrow)– SOLUTIONS ARE BASED ON MEMORY HIERACHIES AND ON PARALLELISM

• “Moore law” (the CPU frequency doubles every 1.5 years) hasbeen valid until 2004 (we’ll see why). Memory and linkstechnologies has been “stopped” too.– Multi-core evolution / revolution.

– The Moore law is now applied to the number of cores (CPUs) in multi-coreand many-core chips.

– What about programming ? One main motivation of SPA and SPM courses.


Interconnection structuresbased on dedicated links: first examples


E

E

E

U1

U2

U3

U4

C

C

C

input

stream

output

stream

Trees:Inteconnection of a single input stream and of a single output stream to a collection of nindependent units.

Communication Latency = O(lg n)

Rings /Tori: Communication Latency = O(n)

U1 U2 U3 U4


Interconnection structuresbased on dedicated links: first examples

Meshes:

U U U

U U U

U U U

Communication Latency = O(n1/2)

possibly toroidal

Interconnection structuresbased on dedicated links• In general, all the most interesting interconnection structures for

– shared memory multiprocessors

– distributed memory multicomputer, cluster, MPP

are “limited degree” networks– based on dedicated links

– with limited number of links per unit: CHIP COUNT problem

• Examples (Course Part 2):– n-dimension cubes

– n-dimensione butterflys

– n-dimension Trees / Fat Trees

– rings and multi-rings

• Currently the “best” commercial networks belong to these classes:– Infiniband

– Myrinet

– …

• Trend: on-chip optical interconnection networks


(Traditional, old-style) Bus


Sender behaviour Receiver behaviour

Shared, bidirectionallink.At most one messageat the time can betransmitted.

Arbitrationmechanisms must beprovided.

Serious problems ofbandwidth(contention) for bus-based parallelstructures.

Arbiter mechanisms (for bus writing)


Independent Request

Arbiters

• Daisy Chaining

• Polling (similar)


Decentralized arbiters

• Token ring

• Non-deterministic with collision detection


Firmware: summary


Hardware

Applications

Processes

Assembler

Firmware

Uniprocessor : Instruction Level Parallelism

Shared Memory Multiprocessor: SMP, NUMA,…

Distributed Memory : Cluster, MPP, …

• The true architectural level.• Microcoded interpretation of assembler

instructions.• This level is fundamental (it cannot be

eliminated).

• Physical resources: combinatorial and sequential circuit components, physical links

I

I

• Decentralized control of firmware interpreter

• Synchronous computation model

• Clock cycle

• Parallelism inside microinstructions

• Clear cost model for calculation and for

communication

• Efficient communications can be implemented

on dedicated links

• Communication-internal calculation

overlapping

Documents

Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism