1
Shared Memory MIMD
Architectures
Sima, Fountain and KacsukChapter 18
CSE462
2
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design choices
Types of shared memory – Physically shared memory– Virtual (or distributed) shared memory
Scalability issues– Organisation of memory– Design of interconnection network– Cache coherence protocols
3
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design space of shared memory computers
Shared memory computers
Single address space memory access
Interconnection scheme
Cache coherency
Physical shared memory UMA
Virtual shared memory
NUMA
CC-NUMA
COMA
Shared pathSwitching network
Singled bus based
Multiple bus based
Bus multiplication
Grid of buses
Hierarchical system
CrossbarMultistage
network
Hardware based
Software based
Omega Banyan Benes
4
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Classification of dynamic interconnection networks
Enable temporary connection of any two components of a multiprocessor
Dynamic interconnection networks
Shared path networks Switching networks
Single bus
Multiple buses
Crossbar
Multistage networks
5
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Buses
Very limited scalability– Typically 3-5 processors unless special
techniques (TDM)– Can be expanded significantly if
• Use private memory
• Coherent cache memory
• Multiple buses
6
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of a single bus multiprocessor (nocaches)
P1 Pk M1 Mn
Bus arbiter
and control
logic
I/O1
I/OM
Address
Data
Control
InterruptBus exchange lines
7
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Locking or multiplexing the bus
Two main approaches Locking and holding
– Acquire the bus– Send out address and/or data– Wait for data (read), wait for write to complete– Release the bus
Multiplexing– Acquire bus time slot– Send address and/or data– Come back for data n cycles later (read), or keep going
if write
8
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory write on locked bus
P1
P2
P3
P4
Processors
Time4 8 12 16
Bus cycle Memory cycle
9
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory write on multiplexed buses
P1
P2
P3
P4
Processors
Time4 7
Bus cycle Memory cycle
Note – This assumes different
memory banks
10
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory read on locked bus
P1
P2
P3
Processors
Time5 10
Phase 1: address bus is used
15 20 25 30
Phase 2: bus is not used
Phase 3: data bus is used
11
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory read on multiplexed bus
P1
P2
P3
Processors
Time5 10
Phase 1: address bus is used
12
Phase 2: bus is not used
Phase 3: data bus is used
12
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory read on split-transaction bus
P1
P2
P3
Processors
Time5 10
Phase 1: address bus is used
Phase 2: bus is not used
Phase 3: data bus is used
Next transfer started before last
one completed!
Needs special associative hardware
13
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Arbiter Logic
Because bus is shared resource, but arbitrate for access
Arbiter may be – Centralised
• Central unit which looks at all requests
– Decentralised.• Logic is split amongst bus masters
• Scalable– Each new master adds more logic
14
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design space of arbiter logic
Arbiter logics
OrganizationBus allocation
policyHandling of
requestsHandling of
grants
Centralized
Distributed
Fixed priority
Rotating
Round robin
Least recently used
First come first served
Fixed priority
Rotating
Fixed priority
Rotating
15
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Centralized arbitration with independent requests and grants
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
R1
G1
R2
G2
RN
GNBus busy
16
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Centralized arbitration with independent requests and grants
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
R1
G1
R2
G2
RN
GNBus busy
Masters Request Bus
17
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Centralized arbitration with independent requests and grants
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
R1
G1
R2
G2
RN
GNBus busy
One is granted
18
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Centralized arbitration with independent requests and grants
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
R1
G1
R2
G2
RN
GNBus busy
Successful master claims bus
19
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Centralized arbitration with independent requests and grants
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
R1
G1
R2
G2
RN
GNBus busy
Bus is released
20
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Daisy-chained bus arbitration scheme
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
G2 GN
Bus
Grant 1
Bus request
Bus busy
21
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Daisy-chained bus arbitration scheme
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
G2 GN
Bus
Grant 1
Bus request
Bus busy
Masters Request Bus
22
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Daisy-chained bus arbitration scheme
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
G2 GN
Bus
Grant 1
Bus request
Bus busy
Bus grant generated
23
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Daisy-chained bus arbitration scheme
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
G2 GN
Bus
Grant 1
Bus request
Bus busy
Bus grant not propagated
24
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Daisy-chained bus arbitration scheme
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
G2 GN
Bus
Grant 1
Bus request
Bus busy
Master claims bus
25
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Daisy-chained bus arbitration scheme
Central bus
arbiter
Master 1 Master 2 Master N
Bus lines
G2 GN
Bus
Grant 1
Bus request
Bus busy
Bus released
26
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Decentralized rotating arbiter with independent requests and grants
Problems with previous design – lack of fairness
– Wait whilst grant signal propagates
Rotating priority solves lack of fairness– Logical first not
same as physical first
Master 1 Master 2 Master N
Bus lines
Arbiter 1 Arbiter 2 Arbiter NP1 P2
R1 G1 R2 G2 R3 G3
PN
Bus busy
R: Request G: Grant P: Priority
27
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Multiple buses
Increase bandwidth by adding additional resources
Bus is limiting factor
28
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
1-dimension multiple bus multiprocessor
Each processor connected to all buses Each memory connected to all buses Processor chooses bus dynamically Load can be spread across buses
P1 P2 Pn M1 M2 Mm
B1
B2
Bb
29
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
2 and 3 dimensional bus system
PM
30
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
2 Dimensional bus design
Can support specialised access patterns e.g. Climate model
– Access to local data– Access to data in same latitude– Access to data in same longitude
32
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Cluster bus architecture
Hierarchy of buses Arbitrary large networks Cache coherence becomes very difficult
Uniform interconn.
card
Uniform interconn.
card
Uniform interconn.
card
Uniform interconn.
cardMultimax
Uniform cluster cache
Cluster bus (Nanobus)
Cluster 1
Uniform interconn.
card
Uniform interconn.
card
Uniform interconn.
card
Uniform interconn.
cardMultimax
Uniform cluster cache
Cluster bus (Nanobus)
Cluster 8
Global bus (Nanobus)
33
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Switching Networks
Multistage networks
No. of stagesNo. of switches
at a stageTopology of links
among stagesSwitch type Operation mode
No. of input andoutput links
Operation mode Blocking
Non-blockingNormal switch
Queuing switch
Combining switch
34
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
View of a crossbar network
Cross bar allows any processor to connect with any memory
As long as there is no contention for the memory, network in non-blocking
P1
S: Switch
P2
Pn
M1 M2 Mn
S S S
S
S
S
S
S
S
35
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
View of a crossbar network
Cross bar allows any processor to connect with any memory
As long as there is no contention for the memory, network in non-blocking
P1
S: Switch
P2
Pn
M1 M2 Mn
S S S
S
S
S
S
S
S
36
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
View of a crossbar network
Cross bar allows any processor to connect with any memory
As long as there is no contention for the memory, network in non-blocking
P1
S: Switch
P2
Pn
M1 M2 Mn
S S S
S
S
S
S
S
S
37
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Detailed structure of a crossbar network
P1
BBCU
Arbiter
Switch
ControlAddressData bus
P1
BBCU
Arbiter
Switch
ControlAddressData bus
Mi
ControlAddressData bus
38
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Multi-stage interconnection networks Cannot directly connect processor to
memory Use cross-bar switches as components to
build larger network Minimum number of stages is logarithmic
– Single path– No fault tolerance– Blocking (if intermediate switch in use)
39
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Omega network topology 2 x 2 cross bar switch components
– Butterfly built from 8x8 Unique path from one port to another Log depth
000001
010011
100101
110111
000001
010011
100101
110111
Upper broadcast
Lower broadcast
Straight through
Straight through
40
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Omega network topology
Some configurations are non blocking– `e.g. reversal
000001
010011
100101
110111
000001
010011
100101
110111
0->7, 1->6, 2->5, 3->4, 4->3, 5->2, 6->1, 7->0
41
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Broadcast in the omega network
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
42
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Blocking in an omega network
000001
010011
100101
110111
000001
010011
100101
110111
(0->5, . . ., 6->4, . . .)
43
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Multistage Network Portperties
Network Type
# Stages Switches/stage
Topology Switch Size
Op Mode
Omega log2N N/2 2-way shuffle 2x2 Blocking
Butterfly log8N N/8 8-way shuffle 8x8 Blocking
Generalized-cube
S= log2N N/2 [0,1] shuffle
[1,S] exchange
2x2 Blocking
Benes S= 2log2N-1 N/2 [2,S] exchange 2x2 Non-blocking
44
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Hot-spot saturation in a blocking omega network
M0M1
M2M3
M4M5
M6M7
P0P1
P2P3
P4P5
P6P7
P2->M4 active => P7->M4 blocked => P1->M5 blocked => P5->M7 blocked
P5->M7 blocked
P1->M5 blocked
P2->M4 active
P7->M4 blocked
45
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Hotspots in Omega networks
In shared memory machine two sorts of contention– Memory unit– Switch elements
Certain access patterns can repeatedly block each other even though addressing different memory units
Message combining can solve these problems– Switch element buffers request– Memory only sees one request
Read 100
Read 100
Read 100
46
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of a combining switch
Introduced on NYU Ultracomputer
Combining queue
Noncomb. queue
Noncomb. queue
Combining queue
Wait buffer
Wait buffer
Proc(i)
Proc(j)
Mem(k)
Mem(I)
47
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Cache Coherence
Cache coherence problems– Sharing of writable data– Process migration– I/O activity
Processor Cache Memory
Processor Cache Memory
Processor Cache Memory
48
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Cache Coherence
Cache coherence problems– Sharing of writable data– Process migration– I/O activity
Processor Cache Memory
Processor Cache Memory
Processor Cache Memory
Write 100,5
Read 100
Cache
Cache
Write 100,100
Cache
49
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Cache Coherence
Cache coherence problems– Sharing of writable data– Process migration– I/O activity
Processor Cache Memory
Processor Cache Memory
Processor Cache Memory
Write 100,5
Read 100
Cache
Cache
Write 100,100CacheProcessor
Processor
50
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Cache Coherence
Cache coherence problems– Sharing of writable data– Process migration– I/O activity
Processor Cache Memory
Processor Cache Memory
Processor Cache Memory
CacheIODev
Read 100
IORead 100
Memory
51
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Classification of data structures
Read only– Never cause cache coherence problems
Shared writable– Main source of cache coherence problems
Private writable data– Causes problems with process migration
Solutions– Hardware based protocols– Software based protocols
52
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design space of hardware-based cache coherence protocols
Hardware-based cachecoherence protocols
Memory updatepolicy
Cache coherencepolicy
Interconnectionscheme
Write-through
Write-back
Write-invalidate
Write-update
** continue next slide **
53
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design space of hardware-based cache coherence protocols (cont.)
Interconnection scheme
Single bus snoopy chache protocols
Multistage directoryschemes
Multiple bus hierachical
Cache coherence protocols
Full-map directories Limited directories Chained directories
Centralized
Distributed
54
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Write-through memory update policy
Memory always updated on a write Intuitively easier to keep caches coherent
D1
Pi
D1
Pj
D
Processor
Store D1
Cache
D1
Memory
55
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Write-back memory update policy
Data only written back to memory when flushed Processor can do many writes before flushed
D
Pi
D1
Pj
D
Processor
Store D1
Cache
Memory
56
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Write-update cache coherence policy
When a processor writes a variable, updates all other copies in other processors
Pj
D1
PkProcessor
Store D1
Cache
Pi
D1
Update (D1)
D1
57
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Write-invalidate cache coherence policy
When a processor writes a variable invalidates copy in any other caches
Makes one processor the “owner”
Pj
D1
PkProcessor
Store D1
Cache
Pi
Invalidate (addr(D)) Invalid data
58
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Snoopy Protocols
If interconnection network supports broadcasting (cheaply) then a snoopy policy is effective– Every cache “watches” every transaction to
memory– Works for buses
If broadcast is not efficient– Directory based scheme– Keeps track of where cache blocks are located
59
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Snoopy write update protocol
Possible cache block states– Used to support cache coherence protocol– Valid-exclusive
• Only copy of this cache block. Cache and memory are consistent
– Shared• Several copies of this cache block
– Dirty• Only copy but cache and memory are inconsistent
60
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Read Miss logic
Snoopy cache controller broadcasts a Read-Blk command on the bus– If there are shared copies
• Delivered by cache with copy
– If dirty copies • It is supplied and flushed to main memory. • All copies become shared.
– If a valid-exclusive copy exists • Copy supplied and all become shared
– If no cache copy • Memory supplies data• Becomes valid exclusive
61
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Snoopy Update - Read miss
D
Pi PjProcessor
Cache
Memory
Load D
Read-blk (addr(D))
Shared Dirty
D
Exclusive
Load D
DD
62
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Write hit logic
If block is valid-exclusive or dirty– Write is performed locally– New state is dirty
If block is shared– Broadcast update block on bus– All copies (including memory) update– Status remain shared.
63
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Snoopy Update – Write hit Exclusive
D
Pi PjProcessor
Cache
Memory
Write D
Shared Dirty
D
Exclusive
D
Load D
Read-blk (addr(D))
64
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Snoopy Update – Write hit Shared
D
Pi PjProcessor
Cache
Memory
Write D
Shared Dirty
D
Exclusive
Load D
Read-blk (addr(D))
Load D
Read-blk (addr(D))
D D
Write (addr(D))
65
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Write miss
If only memory contains copy– Memory updated – Requesting cache loaded with data – valid exclusive
If shared copies are available– All copies (including memory one) updated– Requesting cache loaded with data –shared
If dirty or valid exclusive exist– Other blocks updated– Memory updated– Requesting cache loaded with data –shared
66
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Snoopy Update – Write miss
D
Pi PjProcessor
Cache
Memory
Shared Dirty Exclusive
Write D
D
Write (addr(D))
67
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
State transition graph for snoopy update
Cache responds to – P-READs, P-WRITEs from the Processor and – READ-BLK, WRITE-BLK from the Bus
Valid-exclusive
P-Read
Shared
Dirty
Read-Blk/Write-Blk
P-Read/P-Write
Read-blk/Write-Blk/Update-Blk
Read-Blk/Write-BlkP-Write
P-Read/P-Write
68
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of the snoopy cache controller
Snoopy controller needs to operate at bus speed
D
Memory
A
Processor
Interface
Cache
DA
Cache controller
Snoopy controller
Interface
Cache directory
Snoopy cache controller
Proc.
DA
Cache
PEi
PEn
69
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Directory Schemes
Directory schemes only send consistency commands to those caches where a valid copy of the shared
Designed for systems where snooping is not possible Three main approaches
– Full map directory• Each entry points to all caches• Entry indicates whether block is present in remote caches• Not efficient for large systems
– Limited directory• Only point to subset of the caches
– Works because tend not to share a variable with all processors• Same information as in full map
– Chained directory• Directory entries form a linked list• Scalable – can add processors without increasing directory width
70
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Chained directory scheme
P0
X, CT
PE0
P1
C1
PE1
Pn
Cn
PEn
C X
Read X
Processor
Cache
Shared memory
Directory entry
71
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Chained directory scheme
P0
X, CT
PE0
P1
PE1
Pn
Cn
PEn
C X
Processor
Cache
Shared memory
Directory entry
X,
72
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Scalable Coherence Interface
Concrete example of chained directory IEEE Standard Defines
– Interface to interconnection network– Not any particular interconnection network
Interface– Point to point– Well suited to networks like Convex Exemplar
• Simple, uni-directional ring Designed for building scalable shared memory
machines
73
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of sharing-lists in the SCI
Operations defined for– Creation
– Insertion
– Deletion
– Reduction to single node
Pi
Ci
Nodei
Pj
Cj
Nodej
Pk
Ck
NodekMemory
mstate forw_id
data (64 bits) cstate mem_id
data (64 bits)
forw_id back_id
74
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Insertion in a sharing-list
Pi
Ci
Nodei
Pj
Cj
Nodej
Pk
Ck
Nodek Memory
New-head responses
prepend
75
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Messages for deletion
Pi
Nodei
Pj
Nodej
Pk
NodekMemory
Update forward
Update forward
12
76
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of the sharing-list after deletion
Pi
Nodei
Pk
NodekMemory
77
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Hierarchical cache coherence
C10
Main memory
C20C21
C22
B20
write
X X
P0 P1
C11
P1
C11C10
P0 P1
C11
P1
C11C10
P0
C10
P0
B10 write B11 B12
Invalidate
78
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Software Based Coherence
Software approaches rely on compiler assistance Identify different classes of variables
– Read-only
– Read-only for any number of processors and read-write for one process
– Read-write for one process
– Read-write for any number of processes Once identified (by static analysis), handled
differently
79
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Software based cache coherence
Read only variables– Can be cached any time
Read only for any number and read-write for one process– Can only be cached on writing processor
Read-write for one process– Cache only on that processor
Read-write for many processes– Cannot be cached at all
Clearly need accurate information in order to limit performance hit
80
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Classification of software-based cache coherence protocols
Software-basedprotocols
Indiscriminateinvalidation
Selectinginvalidation
Parallel for-loop based
Critical sectionbased
Fast selectiveinvalidation
Version controlscheme
Timestampscheme
81
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Invalidation
Can invalidate the entire cache– Single hardware mechanism for
clearing valid bits– Very conservative!
Selective invalidation– Invalidate before critical sections– Understand parallel for-loop and
invalidate– Still needs hardware support to clear
effectively
Key Datavvv
82
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Using knowledge of critical regions
:
:
Secure_lock()
Invalidate_cache()
:
:
Flush_cache()
Release_lock()
:
:
Variables in here can be used without worrying about any other processes
83
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Using knowledge of parallel loops
:
:
Par For (I = 0; I < 100; i++) {
:
}
:
:
Par For (I = 0; I < 50; i++) {
:
}
Processor 0
:
:
Par For (I = 50; I < 100; i++) {
:
}
Processor 1
Knowledge about loops
84
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Selective invalidation schemes
Add change bit to cache block status• Set change bit to true
– If read on block then invalidate and reload
Add timestamp to cache block– Clock associated with a data structure
– Update timestamp in cache when block changed
– Can compare timestamp in block with current timestamp
Adding version number– Similar to clock scheme
85
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Synchronization & Event Ordering
Mutual exclusion required in many parallel algorithms– Monitors– Sempahores
All high level schemes base don low level synchronization tools
Atomic test-and-set common in shared memory multiprocessor– Needs to take account of cache
• Minimum traffic generated while waiting• Low latency release of a waiting processor• Low latency acquisition of a free lock
– Typically work well on small bus based machines
86
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Synchronization with test-and-set
Lock variable– Open– Closed
Acquire lockchar *lock;while (exchange(lock,CLOSED) == CLOSED);
Release lock*lock = OPEN;
87
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Cache states after Pi successfully executed test&set on lock
Pi
Ci
Nodei
Pj
Cj
Nodej
Pk
Ck
NodekMemory
Lock: Lock:
Exchange
invalid dirty
88
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Bus commands when Pj executes test&set on lock and cache states after
Pi
Ci
Nodei
Pj
Cj
Nodej
Pk
Ck
NodekMemory
Lock: Lock:
Exchange
invalid dirty
Lock:
(closed)
Read-Blk (lock)
Block (lock)
Invalidate (lock)
1
2
3
89
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Cache states after Pk executed test&set on lock
Pi
Ci
Nodei
Pj
Cj
Nodej
Pk
Ck
NodekMemory
Lock: Lock:
Exchange
invalid dirty
Lock: Lock:
(closed)
90
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Busy waiting with cache coherence
Indivisable test-and-set instruction requires write access to lock– Causes processor doing test-and-set to acquire variable
in cache, invalidating all other copies When multiple processors spin on lock
– Each one tries to acquire the variable in cache
– Causes cache trashing Instead use snooping lock
– Spin on test without indivisible test-and-set
– Only exchange once OPEN
91
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Efficient algorithm for locking
while (exchange(lock, CLOSE) == CLOSE)while (*lock == CLOSED);
First while will claim lock if OPEN and lock it But if already CLOSE transfer control to second
loop– Continuously reads lock
– No bus traffic during this phase
– When lock is OPEN try and test-and-set again.
92
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Test and test-and-set
Even more efficient to test the lock before trying to set itFor (;;) {
While (*lock == CLOSED);If (exchange(lock,CLOSED) != CLOSED)
Break;
} Introduces extra latency for unused locks
93
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Lock implementation on scalable multiprocessors New York Ultracomputer and IBM RP3
implemented – Fetch-and-add
Fetch-and-add – Atomic operation– All memory modules augmented with adder circuitfetch-and-add(x,a)int * x, a;{ int temp;
temp = *x;*x = *x + a;return (temp);
}
94
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Example of fetch-and-add
Suppose we want to implement parallel loopDOALL N = 1 to 1000
<loop body using N>
ENDDO Suppose want to allocate to processors dynamically
N = 0;i = fetch-and-add(N,1)While (i <= 1000) {
loop_body(i);i = fetch-and-add(N,1);
} Regardless of how many processors execute the look
– Each processor will get a different value of i.
95
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Fetch-and-add
Fetch-and-add automatically allocates look indexes in this example
But location N becomes a hotspot. Combining network described before will
not work correctly without modification.– Same value is returned from a read operation
Change each switch element so it can implement the fetch and add operation.
Distributed operation without hotspots
96
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Forward propagation of fetch-and-add
M0M1
M2M3
M4M5
F & A (N,8) M6 returns N=1 N becomes 9
P0: F & A (N,1)P0: F & A (N,1)
P0: F & A (N,1)P0: F & A (N,1)
P0: F & A (N,1)P0: F & A (N,1)
P0: F & A (N,1)P0: F & A (N,1)
1
1
1
1
2
2 4F & A (N,2)
F & A (N,2)
F & A (N,2)
F & A (N,2)
F & A (N,4)
F & A (N,4)
97
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Back propagation of fetch-and-add
M0M1
M2M3
M4M5
P0: 1P0: 5
P0: 3P0: 7
P0: 2P0: 6
P0: 4P0: 8
1
5
3
7
1+4
M6M7
1
1
5
1+2
1
5+2
5
1
3
5
7
1+1
5+1
3+1
7+1
103
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
A quick tour of some UMA machines
Single bus multiprocessors
Bus workingmode
Arbiter logicMemory update
policyCache coherency
policy
Locked bus
Pending bus(Multimax)
Split-transactionbus
(Power challenge)
** Continue next slideWrite-through
(Multimax)
Write-back (Power challenge)
Write-update
Write-invalidate(Multimax)
(Power challenge)
104
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Some real UMA machines
Aribiter Logic
Organization Bus Allocation Policy
Centralized(Multimax)
Distributed(Power Challenge)
Fixed prioroty(Multimax data bus)
Rotating
Round Robin(Multimax address bus)
Powerchallenge
Least recently used
First come first serve
105
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of the Hector machine - NUMA
Station Station StationStation controller
Local ring
Global ring
Local ring
Inter-ring interfaces
Station Station Station
To be continued
106
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of the Hector machine (cont.)
Station
Station bus
Proc. module
Proc. module
I/O. module
Proc. module
Station controller
Station bus interface
Proc. + cache
Memory
Station bus
I/O adaptor
Station bus
display ehternet disk
107
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of the Cray T3D system NUMA
Cray Y-MP host
I/O clusters
Workstations Tapedrives
Disks Networks
108
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design space of CC-NUMA machines
CC-NUMAmachines
Complexity of nodes Main memory distributionCache consistency
schemeInterconnection network
Single processornode
Cluster
Single bus based
Crossbar based
Per columnbus
Per node
Per cluster
Snoopy cache
Snoopy cache +directory
Directory
Grid of buses
Mesh
Ring
111
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The Stanford Dash interconnection network
Cluster 11
Cluster 12
Cluster 21
Cluster 22
Cluster 13
Cluster 23
112
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of a cluster
Memory
Pi 1
Ci 1
I/O Interface
Directory and Intercluster
Interface
113
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Processor level
Pi 1
X
Load X
Ci 1
Access time: 1 clock
114
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Local cluster level
Pi 1
Ci j
Access time: 30 clocks
Ci 1
Load X
Pi j
X
Memory
115
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Home cluster level
Pi 1
Access time: 100 clocks
Ci 1
Load X
Memory
Cluster C1i (local cluster)
Pj 1
Cj 1
Cluster C1j (home cluster)
Memory
X
Interconnection network
116
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Remote cluster level
Pi 1
DL: Directory logicAccess time: 135 clocks
Ci 1
Load X
Memory
Cluster C1i (local cluster)
Pj 1
Cj 1
Cluster C1j
Memory
DLi DLj
Continue on next slide
5
1
117
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Remote cluster level (cont.)
Pm 1
DL: Directory logicAccess time: 135 clocks
Cm 1
D=Dirty
Memory
Cluster C1m (home cluster)
Pk 1
Ck 1
Memory
DLi DLj
8
2
Continue from previous slide
Cluster C1k (remote cluster)
4
3Read Read
Read-Req
D ClkX:
Sharing-Writeback
Read-Req Read-Rply
118
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Structure of the dash directory
Reply Y-dimension
router
Reply Y-dimension
router
Reply controller
(RC)
Pseudo-CPU(PCPU)
RCboard
Reply X-dimension
router
Reply X-dimension
router
Performance monitor
DCboard
Replies to clusters Y+1/Y-1
Requests to clusters Y+1/Y-1
Replies to clusters X+1/X-1
Requests to clusters X+1/X-1
Arbitration masks
Cluster bus request
Remote cache status, bus retry
Events
Directory controller
(DC)
Cluster address/control bus
Cluster datal bus
119
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sequence of actions in a store operation requiring remote service
Pi 1
DL: Directory logic
Ci 1
Store X
Memory
Cluster C1i
Pj 1
Cj 1
Cluster C1j
Memory
DLi DLj
Continue on next slide
5
1
Read- Ex-Req
Read-exclusive4
3Read-exclusive
Inv-Ack
Inv-Req
120
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sequence of actions in a store operation requiring remote service (cont.)
Pm 1
DL: Directory logic
Cm 1
S=Shared
Memory
Cluster C1m
Pk 1
Ck 1
Memory
2
Continue from previous slide
Cluster C1k
3Read-Ex Req Read-exclusive
S ClkX:
Read-Ex Req
Read-Ex Rply
DLm DLk
Inv-Req
Clj
124
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Convex exemplar architecture
CPU1
CPU2
Cache 2 Mb
Cache 2 Mb
Agent
512 Mb memory
CPU3
CPU4
Cache 2 Mb
Cache 2 Mb
Agent
512 Mb memory
CPU5
CPU6
Cache 2 Mb
Cache 2 Mb
Agent
512 Mb memory
CPU7
CPU8
Cache 2 Mb
Cache 2 Mb
Agent
512 Mb memory
I/O subsystem
5x5 crossbar (1.25 Gbytes/sec)
Hypernode 2
Hypernode 16
Hypernode 1
Scalable Coherent Interface Rings (600Mbyte/sec each)
Cache/mem control
Cache/mem control
Cache/mem control
Cache/mem control
125
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Parallel matrix multiply code for cache-coherent machineglobal c(idim, idim), a(idim, idim)
global b(idim, idim), nCPUs
private i, j, k, itid
call spawn ( nCPUs )
do j = 1, idim
if ( jmod(j, nCPU).eq.itid ) then
do i = 1, idim
c(i, j) = 0.0
do k = 1, idim
c(i, j) = c(i, j) + a(i, k) * b(k, j)
enddo
enddo
endif
enddo
call join
126
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Parallel matrix multiply code for non-cache-coherent machineglobal c(idim, idim), a(idim, idim)
global b(idim, idim), nCPUs
private i, j, k, itid, tmp
semaphore is (idim,idim)
call spawn ( nCPUs )
do j = 1, idim
if ( jmod(j, nCPU).eq.itid ) then
do i = 1, idim
tmp = 0.0
do k = 1, idim
call flush (a(i, k))
call flush (b(k, j))
tmp = tmp + a(i, k) * b(k, j)
enddo
127
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Parallel matrix multiply code for non-cache-coherent machine (cont.) call lock (c(i ,j), is(i, j))
c(i, j) = tmp
call flush(c(i, j))
call unlock(c(i, j), is(i, j))
enddo
endif
enddo
call join
134
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Ring: 0 (All CACHE Group: 0)
The hierarchical structure of the Kendall Square Research (KSR1) machine - COMA
Ring: 1 (All CACHE Group: 1)
Ring: 0 directory
Local cache directory
Local cache directory
Local cache Local cache
Processor Processor
Local cache directory
Local cache
Processor
Ring: 0 Ring: 0
Responder 2
Requester 1 Responder 1 Requester 2
Ring: 0 directory
135
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The convergence of scalable MIMD computers
Distributed memory computers
Scalable
Hypercube (Store & forward)
Mesh (Wormhole routing
Processor + comm. proc + router
Shared memory computers
Scalable Small size
Multistage (No cache consistency)
Shared bus (snoopy cache)
NUMA (No cache consistency)
CC-NUMA COMA (Cluster concept)
Multi-threaded computers
Scalable
Multi-threaded processor + communication processor + router + cache + directory
1st generation
2nd generation
3rd generation
4th generation