Download ppt - 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

1

Shared Memory MIMD

Architectures

Sima, Fountain and KacsukChapter 18

CSE462

2

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Design choices

Types of shared memory – Physically shared memory– Virtual (or distributed) shared memory

Scalability issues– Organisation of memory– Design of interconnection network– Cache coherence protocols

3


Design space of shared memory computers

Shared memory computers

Single address space memory access

Interconnection scheme

Cache coherency

Physical shared memory UMA

Virtual shared memory

NUMA

CC-NUMA

COMA

Shared pathSwitching network

Singled bus based

Multiple bus based

Bus multiplication

Grid of buses

Hierarchical system

CrossbarMultistage

network

Hardware based

Software based

Omega Banyan Benes

4


Classification of dynamic interconnection networks

Enable temporary connection of any two components of a multiprocessor

Dynamic interconnection networks

Shared path networks Switching networks

Single bus

Multiple buses

Crossbar

Multistage networks

5


Buses

Very limited scalability– Typically 3-5 processors unless special

techniques (TDM)– Can be expanded significantly if

• Use private memory

• Coherent cache memory

• Multiple buses

6


Structure of a single bus multiprocessor (nocaches)

P1 Pk M1 Mn

Bus arbiter

and control

logic

I/O1

I/OM

Address

Data

Control

InterruptBus exchange lines

7


Locking or multiplexing the bus

Two main approaches Locking and holding

– Acquire the bus– Send out address and/or data– Wait for data (read), wait for write to complete– Release the bus

Multiplexing– Acquire bus time slot– Send address and/or data– Come back for data n cycles later (read), or keep going

if write

8


Memory write on locked bus

P1

P2

P3

P4

Processors

Time4 8 12 16

Bus cycle Memory cycle

9


Memory write on multiplexed buses

P1

P2

P3

P4

Processors

Time4 7

Bus cycle Memory cycle

Note – This assumes different

memory banks

10


Memory read on locked bus

P1

P2

P3

Processors

Time5 10

Phase 1: address bus is used

15 20 25 30

Phase 2: bus is not used

Phase 3: data bus is used

11


Memory read on multiplexed bus

P1

P2

P3

Processors

Time5 10


12



12


Memory read on split-transaction bus

P1

P2

P3

Processors

Time5 10




Next transfer started before last

one completed!

Needs special associative hardware

13


Arbiter Logic

Because bus is shared resource, but arbitrate for access

Arbiter may be – Centralised

• Central unit which looks at all requests

– Decentralised.• Logic is split amongst bus masters

• Scalable– Each new master adds more logic

14


Design space of arbiter logic

Arbiter logics

OrganizationBus allocation

policyHandling of

requestsHandling of

grants

Centralized

Distributed

Fixed priority

Rotating

Round robin

Least recently used

First come first served

Fixed priority

Rotating

Fixed priority

Rotating

15


Centralized arbitration with independent requests and grants

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

R1

G1

R2

G2

RN

GNBus busy

16



Central bus

arbiter


Bus lines

R1

G1

R2

G2

RN

GNBus busy

Masters Request Bus

17



Central bus

arbiter


Bus lines

R1

G1

R2

G2

RN

GNBus busy

One is granted

18



Central bus

arbiter


Bus lines

R1

G1

R2

G2

RN

GNBus busy

Successful master claims bus

19



Central bus

arbiter


Bus lines

R1

G1

R2

G2

RN

GNBus busy

Bus is released

20


Daisy-chained bus arbitration scheme

Central bus

arbiter


Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

21



Central bus

arbiter


Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Masters Request Bus

22



Central bus

arbiter


Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Bus grant generated

23



Central bus

arbiter


Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Bus grant not propagated

24



Central bus

arbiter


Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Master claims bus

25



Central bus

arbiter


Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Bus released

26


Decentralized rotating arbiter with independent requests and grants

Problems with previous design – lack of fairness

– Wait whilst grant signal propagates

Rotating priority solves lack of fairness– Logical first not

same as physical first


Bus lines

Arbiter 1 Arbiter 2 Arbiter NP1 P2

R1 G1 R2 G2 R3 G3

PN

Bus busy

R: Request G: Grant P: Priority

27


Multiple buses

Increase bandwidth by adding additional resources

Bus is limiting factor

28


1-dimension multiple bus multiprocessor

Each processor connected to all buses Each memory connected to all buses Processor chooses bus dynamically Load can be spread across buses

P1 P2 Pn M1 M2 Mm

B1

B2

Bb

29


2 and 3 dimensional bus system

PM

30


2 Dimensional bus design

Can support specialised access patterns e.g. Climate model

– Access to local data– Access to data in same latitude– Access to data in same longitude

32


Cluster bus architecture

Hierarchy of buses Arbitrary large networks Cache coherence becomes very difficult

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

cardMultimax

Uniform cluster cache

Cluster bus (Nanobus)

Cluster 1

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

cardMultimax

Uniform cluster cache

Cluster bus (Nanobus)

Cluster 8

Global bus (Nanobus)

33


Switching Networks

Multistage networks

No. of stagesNo. of switches

at a stageTopology of links

among stagesSwitch type Operation mode

No. of input andoutput links

Operation mode Blocking

Non-blockingNormal switch

Queuing switch

Combining switch

34


View of a crossbar network

Cross bar allows any processor to connect with any memory

As long as there is no contention for the memory, network in non-blocking

P1

S: Switch

P2

Pn

M1 M2 Mn

S S S

S

S

S

S

S

S

35





P1

S: Switch

P2

Pn

M1 M2 Mn

S S S

S

S

S

S

S

S

36





P1

S: Switch

P2

Pn

M1 M2 Mn

S S S

S

S

S

S

S

S

37


Detailed structure of a crossbar network

P1

BBCU

Arbiter

Switch

ControlAddressData bus

P1

BBCU

Arbiter

Switch


Mi


38


Multi-stage interconnection networks Cannot directly connect processor to

memory Use cross-bar switches as components to

build larger network Minimum number of stages is logarithmic

– Single path– No fault tolerance– Blocking (if intermediate switch in use)

39


Omega network topology 2 x 2 cross bar switch components

– Butterfly built from 8x8 Unique path from one port to another Log depth

000001

010011

100101

110111

000001

010011

100101

110111

Upper broadcast

Lower broadcast

Straight through

Straight through

40


Omega network topology

Some configurations are non blocking– `e.g. reversal

000001

010011

100101

110111

000001

010011

100101

110111

0->7, 1->6, 2->5, 3->4, 4->3, 5->2, 6->1, 7->0

41


Broadcast in the omega network

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

42


Blocking in an omega network

000001

010011

100101

110111

000001

010011

100101

110111

(0->5, . . ., 6->4, . . .)

43


Multistage Network Portperties

Network Type

# Stages Switches/stage

Topology Switch Size

Op Mode

Omega log2N N/2 2-way shuffle 2x2 Blocking

Butterfly log8N N/8 8-way shuffle 8x8 Blocking

Generalized-cube

S= log2N N/2 [0,1] shuffle

[1,S] exchange

2x2 Blocking

Benes S= 2log2N-1 N/2 [2,S] exchange 2x2 Non-blocking

44


Hot-spot saturation in a blocking omega network

M0M1

M2M3

M4M5

M6M7

P0P1

P2P3

P4P5

P6P7

P2->M4 active => P7->M4 blocked => P1->M5 blocked => P5->M7 blocked

P5->M7 blocked

P1->M5 blocked

P2->M4 active

P7->M4 blocked

45


Hotspots in Omega networks

In shared memory machine two sorts of contention– Memory unit– Switch elements

Certain access patterns can repeatedly block each other even though addressing different memory units

Message combining can solve these problems– Switch element buffers request– Memory only sees one request

Read 100

Read 100

Read 100

46


Structure of a combining switch

Introduced on NYU Ultracomputer

Combining queue

Noncomb. queue

Noncomb. queue

Combining queue

Wait buffer

Wait buffer

Proc(i)

Proc(j)

Mem(k)

Mem(I)

47


Cache Coherence

Cache coherence problems– Sharing of writable data– Process migration– I/O activity

Processor Cache Memory



48


Cache Coherence





Write 100,5

Read 100

Cache

Cache

Write 100,100

Cache

49


Cache Coherence





Write 100,5

Read 100

Cache

Cache

Write 100,100CacheProcessor

Processor

50


Cache Coherence





CacheIODev

Read 100

IORead 100

Memory

51


Classification of data structures

Read only– Never cause cache coherence problems

Shared writable– Main source of cache coherence problems

Private writable data– Causes problems with process migration

Solutions– Hardware based protocols– Software based protocols

52


Design space of hardware-based cache coherence protocols

Hardware-based cachecoherence protocols

Memory updatepolicy

Cache coherencepolicy

Interconnectionscheme

Write-through

Write-back

Write-invalidate

Write-update

** continue next slide **

53


Design space of hardware-based cache coherence protocols (cont.)

Interconnection scheme

Single bus snoopy chache protocols

Multistage directoryschemes

Multiple bus hierachical

Cache coherence protocols

Full-map directories Limited directories Chained directories

Centralized

Distributed

54


Write-through memory update policy

Memory always updated on a write Intuitively easier to keep caches coherent

D1

Pi

D1

Pj

D

Processor

Store D1

Cache

D1

Memory

55


Write-back memory update policy

Data only written back to memory when flushed Processor can do many writes before flushed

D

Pi

D1

Pj

D

Processor

Store D1

Cache

Memory

56


Write-update cache coherence policy

When a processor writes a variable, updates all other copies in other processors

Pj

D1

PkProcessor

Store D1

Cache

Pi

D1

Update (D1)

D1

57


Write-invalidate cache coherence policy

When a processor writes a variable invalidates copy in any other caches

Makes one processor the “owner”

Pj

D1

PkProcessor

Store D1

Cache

Pi

Invalidate (addr(D)) Invalid data

58


Snoopy Protocols

If interconnection network supports broadcasting (cheaply) then a snoopy policy is effective– Every cache “watches” every transaction to

memory– Works for buses

If broadcast is not efficient– Directory based scheme– Keeps track of where cache blocks are located

59


Snoopy write update protocol

Possible cache block states– Used to support cache coherence protocol– Valid-exclusive

• Only copy of this cache block. Cache and memory are consistent

– Shared• Several copies of this cache block

– Dirty• Only copy but cache and memory are inconsistent

60


Read Miss logic

Snoopy cache controller broadcasts a Read-Blk command on the bus– If there are shared copies

• Delivered by cache with copy

– If dirty copies • It is supplied and flushed to main memory. • All copies become shared.

– If a valid-exclusive copy exists • Copy supplied and all become shared

– If no cache copy • Memory supplies data• Becomes valid exclusive

61


Snoopy Update - Read miss

D

Pi PjProcessor

Cache

Memory

Load D

Read-blk (addr(D))

Shared Dirty

D

Exclusive

Load D

DD

62


Write hit logic

If block is valid-exclusive or dirty– Write is performed locally– New state is dirty

If block is shared– Broadcast update block on bus– All copies (including memory) update– Status remain shared.

63


Snoopy Update – Write hit Exclusive

D

Pi PjProcessor

Cache

Memory

Write D

Shared Dirty

D

Exclusive

D

Load D

Read-blk (addr(D))

64


Snoopy Update – Write hit Shared

D

Pi PjProcessor

Cache

Memory

Write D

Shared Dirty

D

Exclusive

Load D

Read-blk (addr(D))

Load D

Read-blk (addr(D))

D D

Write (addr(D))

65


Write miss

If only memory contains copy– Memory updated – Requesting cache loaded with data – valid exclusive

If shared copies are available– All copies (including memory one) updated– Requesting cache loaded with data –shared

If dirty or valid exclusive exist– Other blocks updated– Memory updated– Requesting cache loaded with data –shared

66


Snoopy Update – Write miss

D

Pi PjProcessor

Cache

Memory

Shared Dirty Exclusive

Write D

D

Write (addr(D))

67


State transition graph for snoopy update

Cache responds to – P-READs, P-WRITEs from the Processor and – READ-BLK, WRITE-BLK from the Bus

Valid-exclusive

P-Read

Shared

Dirty

Read-Blk/Write-Blk

P-Read/P-Write

Read-blk/Write-Blk/Update-Blk

Read-Blk/Write-BlkP-Write

P-Read/P-Write

68


Structure of the snoopy cache controller

Snoopy controller needs to operate at bus speed

D

Memory

A

Processor

Interface

Cache

DA

Cache controller

Snoopy controller

Interface

Cache directory

Snoopy cache controller

Proc.

DA

Cache

PEi

PEn

69


Directory Schemes

Directory schemes only send consistency commands to those caches where a valid copy of the shared

Designed for systems where snooping is not possible Three main approaches

– Full map directory• Each entry points to all caches• Entry indicates whether block is present in remote caches• Not efficient for large systems

– Limited directory• Only point to subset of the caches

– Works because tend not to share a variable with all processors• Same information as in full map

– Chained directory• Directory entries form a linked list• Scalable – can add processors without increasing directory width

70


Chained directory scheme

P0

X, CT

PE0

P1

C1

PE1

Pn

Cn

PEn

C X

Read X

Processor

Cache

Shared memory

Directory entry

71


Chained directory scheme

P0

X, CT

PE0

P1

PE1

Pn

Cn

PEn

C X

Processor

Cache

Shared memory

Directory entry

X,

72


Scalable Coherence Interface

Concrete example of chained directory IEEE Standard Defines

– Interface to interconnection network– Not any particular interconnection network

Interface– Point to point– Well suited to networks like Convex Exemplar

• Simple, uni-directional ring Designed for building scalable shared memory

machines

73


Structure of sharing-lists in the SCI

Operations defined for– Creation

– Insertion

– Deletion

– Reduction to single node

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

mstate forw_id

data (64 bits) cstate mem_id

data (64 bits)

forw_id back_id

74


Insertion in a sharing-list

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

Nodek Memory

New-head responses

prepend

75


Messages for deletion

Pi

Nodei

Pj

Nodej

Pk

NodekMemory

Update forward

Update forward

12

76


Structure of the sharing-list after deletion

Pi

Nodei

Pk

NodekMemory

77


Hierarchical cache coherence

C10

Main memory

C20C21

C22

B20

write

X X

P0 P1

C11

P1

C11C10

P0 P1

C11

P1

C11C10

P0

C10

P0

B10 write B11 B12

Invalidate

78


Software Based Coherence

Software approaches rely on compiler assistance Identify different classes of variables

– Read-only

– Read-only for any number of processors and read-write for one process

– Read-write for one process

– Read-write for any number of processes Once identified (by static analysis), handled

differently

79


Software based cache coherence

Read only variables– Can be cached any time

Read only for any number and read-write for one process– Can only be cached on writing processor

Read-write for one process– Cache only on that processor

Read-write for many processes– Cannot be cached at all

Clearly need accurate information in order to limit performance hit

80


Classification of software-based cache coherence protocols

Software-basedprotocols

Indiscriminateinvalidation

Selectinginvalidation

Parallel for-loop based

Critical sectionbased

Fast selectiveinvalidation

Version controlscheme

Timestampscheme

81


Invalidation

Can invalidate the entire cache– Single hardware mechanism for

clearing valid bits– Very conservative!

Selective invalidation– Invalidate before critical sections– Understand parallel for-loop and

invalidate– Still needs hardware support to clear

effectively

Key Datavvv

82


Using knowledge of critical regions

:

:

Secure_lock()

Invalidate_cache()

:

:

Flush_cache()

Release_lock()

:

:

Variables in here can be used without worrying about any other processes

83


Using knowledge of parallel loops

:

:

Par For (I = 0; I < 100; i++) {

:

}

:

:

Par For (I = 0; I < 50; i++) {

:

}

Processor 0

:

:

Par For (I = 50; I < 100; i++) {

:

}

Processor 1

Knowledge about loops

84


Selective invalidation schemes

Add change bit to cache block status• Set change bit to true

– If read on block then invalidate and reload

Add timestamp to cache block– Clock associated with a data structure

– Update timestamp in cache when block changed

– Can compare timestamp in block with current timestamp

Adding version number– Similar to clock scheme

85


Synchronization & Event Ordering

Mutual exclusion required in many parallel algorithms– Monitors– Sempahores

All high level schemes base don low level synchronization tools

Atomic test-and-set common in shared memory multiprocessor– Needs to take account of cache

• Minimum traffic generated while waiting• Low latency release of a waiting processor• Low latency acquisition of a free lock

– Typically work well on small bus based machines

86


Synchronization with test-and-set

Lock variable– Open– Closed

Acquire lockchar *lock;while (exchange(lock,CLOSED) == CLOSED);

Release lock*lock = OPEN;

87


Cache states after Pi successfully executed test&set on lock

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

Lock: Lock:

Exchange

invalid dirty

88


Bus commands when Pj executes test&set on lock and cache states after

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

Lock: Lock:

Exchange

invalid dirty

Lock:

(closed)

Read-Blk (lock)

Block (lock)

Invalidate (lock)

1

2

3

89


Cache states after Pk executed test&set on lock

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

Lock: Lock:

Exchange

invalid dirty

Lock: Lock:

(closed)

90


Busy waiting with cache coherence

Indivisable test-and-set instruction requires write access to lock– Causes processor doing test-and-set to acquire variable

in cache, invalidating all other copies When multiple processors spin on lock

– Each one tries to acquire the variable in cache

– Causes cache trashing Instead use snooping lock

– Spin on test without indivisible test-and-set

– Only exchange once OPEN

91


Efficient algorithm for locking

while (exchange(lock, CLOSE) == CLOSE)while (*lock == CLOSED);

First while will claim lock if OPEN and lock it But if already CLOSE transfer control to second

loop– Continuously reads lock

– No bus traffic during this phase

– When lock is OPEN try and test-and-set again.

92


Test and test-and-set

Even more efficient to test the lock before trying to set itFor (;;) {

While (*lock == CLOSED);If (exchange(lock,CLOSED) != CLOSED)

Break;

} Introduces extra latency for unused locks

93


Lock implementation on scalable multiprocessors New York Ultracomputer and IBM RP3

implemented – Fetch-and-add

Fetch-and-add – Atomic operation– All memory modules augmented with adder circuitfetch-and-add(x,a)int * x, a;{ int temp;

temp = *x;*x = *x + a;return (temp);

}

94


Example of fetch-and-add

Suppose we want to implement parallel loopDOALL N = 1 to 1000

<loop body using N>

ENDDO Suppose want to allocate to processors dynamically

N = 0;i = fetch-and-add(N,1)While (i <= 1000) {

loop_body(i);i = fetch-and-add(N,1);

} Regardless of how many processors execute the look

– Each processor will get a different value of i.

95


Fetch-and-add

Fetch-and-add automatically allocates look indexes in this example

But location N becomes a hotspot. Combining network described before will

not work correctly without modification.– Same value is returned from a read operation

Change each switch element so it can implement the fetch and add operation.

Distributed operation without hotspots

96


Forward propagation of fetch-and-add

M0M1

M2M3

M4M5

F & A (N,8) M6 returns N=1 N becomes 9

P0: F & A (N,1)P0: F & A (N,1)

P0: F & A (N,1)P0: F & A (N,1)

P0: F & A (N,1)P0: F & A (N,1)

P0: F & A (N,1)P0: F & A (N,1)

1

1

1

1

2

2 4F & A (N,2)

F & A (N,2)

F & A (N,2)

F & A (N,2)

F & A (N,4)

F & A (N,4)

97


Back propagation of fetch-and-add

M0M1

M2M3

M4M5

P0: 1P0: 5

P0: 3P0: 7

P0: 2P0: 6

P0: 4P0: 8

1

5

3

7

1+4

M6M7

1

1

5

1+2

1

5+2

5

1

3

5

7

1+1

5+1

3+1

7+1

103


A quick tour of some UMA machines

Single bus multiprocessors

Bus workingmode

Arbiter logicMemory update

policyCache coherency

policy

Locked bus

Pending bus(Multimax)

Split-transactionbus

(Power challenge)

** Continue next slideWrite-through

(Multimax)

Write-back (Power challenge)

Write-update

Write-invalidate(Multimax)

(Power challenge)

104


Some real UMA machines

Aribiter Logic

Organization Bus Allocation Policy

Centralized(Multimax)

Distributed(Power Challenge)

Fixed prioroty(Multimax data bus)

Rotating

Round Robin(Multimax address bus)

Powerchallenge

Least recently used

First come first serve

105


Structure of the Hector machine - NUMA

Station Station StationStation controller

Local ring

Global ring

Local ring

Inter-ring interfaces

Station Station Station

To be continued

106


Structure of the Hector machine (cont.)

Station

Station bus

Proc. module

Proc. module

I/O. module

Proc. module

Station controller

Station bus interface

Proc. + cache

Memory

Station bus

I/O adaptor

Station bus

display ehternet disk

107


Structure of the Cray T3D system NUMA

Cray Y-MP host

I/O clusters

Workstations Tapedrives

Disks Networks

108


Design space of CC-NUMA machines

CC-NUMAmachines

Complexity of nodes Main memory distributionCache consistency

schemeInterconnection network

Single processornode

Cluster

Single bus based

Crossbar based

Per columnbus

Per node

Per cluster

Snoopy cache

Snoopy cache +directory

Directory

Grid of buses

Mesh

Ring

111


The Stanford Dash interconnection network

Cluster 11

Cluster 12

Cluster 21

Cluster 22

Cluster 13

Cluster 23

112


Structure of a cluster

Memory

Pi 1

Ci 1

I/O Interface

Directory and Intercluster

Interface

113


Processor level

Pi 1

X

Load X

Ci 1

Access time: 1 clock

114


Local cluster level

Pi 1

Ci j

Access time: 30 clocks

Ci 1

Load X

Pi j

X

Memory

115


Home cluster level

Pi 1

Access time: 100 clocks

Ci 1

Load X

Memory

Cluster C1i (local cluster)

Pj 1

Cj 1

Cluster C1j (home cluster)

Memory

X

Interconnection network

116


Remote cluster level

Pi 1

DL: Directory logicAccess time: 135 clocks

Ci 1

Load X

Memory

Cluster C1i (local cluster)

Pj 1

Cj 1

Cluster C1j

Memory

DLi DLj

Continue on next slide

5

1

117


Remote cluster level (cont.)

Pm 1

DL: Directory logicAccess time: 135 clocks

Cm 1

D=Dirty

Memory

Cluster C1m (home cluster)

Pk 1

Ck 1

Memory

DLi DLj

8

2

Continue from previous slide

Cluster C1k (remote cluster)

4

3Read Read

Read-Req

D ClkX:

Sharing-Writeback

Read-Req Read-Rply

118


Structure of the dash directory

Reply Y-dimension

router

Reply Y-dimension

router

Reply controller

(RC)

Pseudo-CPU(PCPU)

RCboard

Reply X-dimension

router

Reply X-dimension

router

Performance monitor

DCboard

Replies to clusters Y+1/Y-1

Requests to clusters Y+1/Y-1

Replies to clusters X+1/X-1

Requests to clusters X+1/X-1

Arbitration masks

Cluster bus request

Remote cache status, bus retry

Events

Directory controller

(DC)

Cluster address/control bus

Cluster datal bus

119


Sequence of actions in a store operation requiring remote service

Pi 1

DL: Directory logic

Ci 1

Store X

Memory

Cluster C1i

Pj 1

Cj 1

Cluster C1j

Memory

DLi DLj

Continue on next slide

5

1

Read- Ex-Req

Read-exclusive4

3Read-exclusive

Inv-Ack

Inv-Req

120


Sequence of actions in a store operation requiring remote service (cont.)

Pm 1

DL: Directory logic

Cm 1

S=Shared

Memory

Cluster C1m

Pk 1

Ck 1

Memory

2

Continue from previous slide

Cluster C1k

3Read-Ex Req Read-exclusive

S ClkX:

Read-Ex Req

Read-Ex Rply

DLm DLk

Inv-Req

Clj

124


Convex exemplar architecture

CPU1

CPU2

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

CPU3

CPU4

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

CPU5

CPU6

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

CPU7

CPU8

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

I/O subsystem

5x5 crossbar (1.25 Gbytes/sec)

Hypernode 2

Hypernode 16

Hypernode 1

Scalable Coherent Interface Rings (600Mbyte/sec each)

Cache/mem control

Cache/mem control

Cache/mem control

Cache/mem control

125


Parallel matrix multiply code for cache-coherent machineglobal c(idim, idim), a(idim, idim)

global b(idim, idim), nCPUs

private i, j, k, itid

call spawn ( nCPUs )

do j = 1, idim

if ( jmod(j, nCPU).eq.itid ) then

do i = 1, idim

c(i, j) = 0.0

do k = 1, idim

c(i, j) = c(i, j) + a(i, k) * b(k, j)

enddo

enddo

endif

enddo

call join

126


Parallel matrix multiply code for non-cache-coherent machineglobal c(idim, idim), a(idim, idim)

global b(idim, idim), nCPUs

private i, j, k, itid, tmp

semaphore is (idim,idim)

call spawn ( nCPUs )

do j = 1, idim

if ( jmod(j, nCPU).eq.itid ) then

do i = 1, idim

tmp = 0.0

do k = 1, idim

call flush (a(i, k))

call flush (b(k, j))

tmp = tmp + a(i, k) * b(k, j)

enddo

127


Parallel matrix multiply code for non-cache-coherent machine (cont.) call lock (c(i ,j), is(i, j))

c(i, j) = tmp

call flush(c(i, j))

call unlock(c(i, j), is(i, j))

enddo

endif

enddo

call join

134


Ring: 0 (All CACHE Group: 0)

The hierarchical structure of the Kendall Square Research (KSR1) machine - COMA

Ring: 1 (All CACHE Group: 1)

Ring: 0 directory

Local cache directory


Local cache Local cache

Processor Processor


Local cache

Processor

Ring: 0 Ring: 0

Responder 2

Requester 1 Responder 1 Requester 2

Ring: 0 directory

135


The convergence of scalable MIMD computers

Distributed memory computers

Scalable

Hypercube (Store & forward)

Mesh (Wormhole routing

Processor + comm. proc + router

Shared memory computers

Scalable Small size

Multistage (No cache consistency)

Shared bus (snoopy cache)

NUMA (No cache consistency)

CC-NUMA COMA (Cluster concept)

Multi-threaded computers

Scalable

Multi-threaded processor + communication processor + router + cache + directory

1st generation

2nd generation

3rd generation

4th generation