Upload
hannah-emily-bates
View
215
Download
0
Embed Size (px)
Citation preview
04/18/23 slide 1PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Snoop-based multiprocessor design
Correctness issues semantic model: coherence and memory consistency dead-lock, live-lock, and starvation
Design issues simplistic-to-realistic one-by-one: Single-level cache and an atomic bus Multi-level cache design issues Split-transaction bus design issues
Scalable snoop-based design techniques
More Architectural Support for MIMD
04/18/23 slide 2PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Key goals Key goals
Correctness Design simplicity (verification is costly) High performance
Design simplicity and performance are often at odds
Get picture of bus-based coherence organization, dual tags, proc-side and bus-side controllers
04/18/23 slide 3PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Correctness RequirementsCorrectness Requirements
Semantic model: contract between HW/SW cache coherence -> write serialization sequential consistency -> prog. order, write atomicity
Deadlock: no forward progress and no system activity resources being held in a cyclic relationship
Livelock: no forward progress but system activity allocation/de-allocation of resources with no progress
Starvation: some processes are denied service often temporary
04/18/23 slide 4PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Single-Level Single-Level CCache and ache and AAtomic tomic BBusus
Single-level caches and an atomic bus Tag and cache controller design issues Snoop protocol design Race conditions: non-atomic state transitions
Correctness issues serialization deadlock, livelock, and starvation
Atomic (synchronization) operations
04/18/23 slide 5PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Cache Cache CController ontroller DDesignesignExtension for snoop support: bus requests also access cache processor-side controller bus-side controller
Recall actions on a cache access:
1. Indexing cache with tag check
2. Get/request data
3. Update state bits
Cached data
Tags
Processor requests
bus requests
Performance issue:
Simultaneous tag accesses from processor and bus
Solution:
Duplicate tags but keep them consistent
Tags
04/18/23 slide 6PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Reporting Reporting SSnoop noop RResultsesultsWhere to read (memory or cache) and what state transition to make? support wired-and/or bus lines
When is the snoop result available? (main alternatives) synchronous: requires dual tags and must adapt to
worst-case because of updates of state bits caused by processor
asynchronous (variable delay snoop): assume minimum delay but add enough cycles if necessary
memory state bit to distinguish between valid/invalid memory block
04/18/23 slide 7PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Dealing with Write-backsDealing with Write-backs
One would like to service miss before writing back the replaced block
Two implications: Add a write-back buffer Bus snoops must also look into write-back buffer
04/18/23 slide 8PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Baseline ArchitectureBaseline Architecture
Write-back buffer
04/18/23 slide 9PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
State Transitions Must State Transitions Must Appear AtomicAppear Atomic
Upgr
Cache 1 Cache 2
Upgr 1. Await use of bus
2. Cache 2 gets access to bus
3. Upgrade fromCache 2 updatesstate of Cache 1to invalid
4. Upgrade from cache 1is performed. However,Upgrade is not appropriate
Assume a block isin shared state inboth caches
04/18/23 slide 10PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Non-Atomic Non-Atomic SState tate TTransitionsransitionsTime window between issuing and performing of a bus operation Problem: another transaction may change action Solution: extend with non-atomic state
04/18/23 slide 11PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Correctness Correctness IIssuesssues Write serialization: ownership acquisition and cache block modification should appear atomic processor may not write data into cache until read-
exclusive request is on bus; it is committed Deadlock: Two cache controllers may be in a circular dependence relation if one is locking the cache while waiting for the bus (fetch deadlock) Livelock: If several controllers issue read-exclusive requests for same block at the same time Let each one complete before taking care of next
Starvation: Bus arbitration is unfair to some nodes
04/18/23 slide 12PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
A Fetch-Deadlock SituationA Fetch-Deadlock Situation
ReadX B
Cache 1 Cache 2
BusRd A1. Await use of bus, but Cache 1 is locked
2. Cache 2 gets access to bus
3. Cache 2 waits for Cache 1to respond and Cache 1 waitsfor Cache 2 to release the busDeadlock!
AB
04/18/23 slide 13PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
A Livelock SituationA Livelock Situation
ReadX A
Cache 1 Cache 2
ReadX A1. Try to get bus
3. Make Cache 2’scopy invalid
Etc……Livelock!
A read exclusive operation involves:1. Acquisition of an exclusive block2. Reattempting the write in the local cache 2. Make cache 1’s
copy invalid
04/18/23 slide 14PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Remedies to Correctness Remedies to Correctness IssuesIssues
Do not update cache until Upgrade is on busService incoming snoops while waiting for busComplete the transaction with no interruption
Upgr
Cache 1 Cache 2
Upgr
04/18/23 slide 15PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Implementation of Implementation of AAtomic tomic MMemory emory OOperationsperations
Test&set should result in atomic read-modify-write Cacheable t&s vs memory-based implementation lower latency & bw for spinning and self-acquisition longer time to transfer lock to other node memory-based requires bus to be locked down
Load-linked (LL) and store-conditional (SC) implementation Lock flag and lock address register at each processor LL reads block, sets lock flag, puts block address in reg Incoming invalidates checked against address: if match,
reset flag SC checks lock flag as indicator of intervening conflicting
write: if reset, fail; if not, succeed
04/18/23 slide 16PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Multi-Level Multi-Level CCache ache DDesignsesigns Coherence needs to be extended across L1 and L2 L1 on-chip. Snoop support in L1 expensive
Is snoop support needed in L1?
P L1 L2
M
Definition: L1 included in L2 iff all blocks in L1 also in L2
If inclusion maintained then snoop support only needed at L2 (must be able to invalidate blocks in L1)
Consequence: a block in owned state in L1 (M in MSI) must be marked modified in L2
04/18/23 slide 17PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Maintaining Maintaining IInclusionnclusion
Violations to the inclusion property: Set-associative L1 with history-based replacement
algorithm Split I- and D-caches at L1 and unified at L2 Different cache block sizes in L1 and L2
Techniques to maintain inclusion:
Direct-mapped L1 and L2 with any associativity given some additional constraints for block size, fetch policy, …
Note: One can always displace a block in L1 on replacement in L2 to maintain inclusion
04/18/23 slide 18PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Split Split TTransaction ransaction BBusesuses
Challenging issues: Avoid conflicting requests in progress simultaneously Buffers needed => flow control Correctness issues (coherence, SC, deadlock, livelock,...)
Separate request-response phases improve bus utilization
Mem Access Delay
Address/CMD
Mem Access Delay
Data
Address/CMD
Data
Address/CMD
Busarbitration
04/18/23 slide 19PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Example of Conflict SituationExample of Conflict Situation
With atomic bus, Upgrade is committed when bus is grantedHere, two Upgrades can be on bus and may invalidate both copies
Upgr
Cache 1 Cache 2
Upgr
Some real examples
Details can be interesting Supports historical emphasis of the course SGI Power Challenge
04/18/23 slide 20PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
04/18/23 slide 21PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
SGI Challenge 1(SGI Challenge 1(44))
High-level design decisions Avoid conflicts: Allow a fixed number of requests to different blocks in progress at a time Flow-control: Limited buffers, so NACK when full and retry Ordering: Allow out-of-order responses (to cope with non-uniform delays)
Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack
Addrreq
Addr Addr
Datareq
Tag
D0 D1 D2 D3
Addrreq
Addr Addr
Datareq
Tag
Grant
D0
check check
ackack
Time
Addressbus
Dataarbitration
Databus
Read operation 1
Read operation 2
04/18/23 slide 22PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
SGI Challenge SGI Challenge 22((44))
Separate request-response buses Request phase: (use address request bus) present the address and initiate snooping report snoop result (prolong or nack if necessary)
Response phase: (use data request bus) send data back
Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack
Addrreq
Addr Addr
Datareq
Tag
D0 D1 D2 D3
Addrreq
Addr Addr
Datareq
Tag
Grant
D0
check check
ackack
Time
Addressbus
Dataarbitration
Databus
Read operation 1
Read operation 2
04/18/23 slide 23PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Design of Design of SGI Challenge SGI Challenge 33((44)) Max 8 outstand. requests 3-bit tag to separate req. Request table in each node to keep track of outstanding requests Writes are committed when request is granted Flow control: NACK and retry when buffers are full
Addr + cmdSnoop Data buffer
Write-back buffer
Comparator
Tag
Addr + cmd
Tocontrol
TagTag
Data to/from $
Requestbuffer
Request table
Tag
7
Add
ress
Request +
Mis
cella
neous
responsequeue
Addr + cmd bus
Data + tag bus
Snoop statefrom $
state
Issue +merge
Writ
e back
s
Resp
onse
s
check
0
Origi
nato
r
My
resp
ons
e
info
rmatio
n
Res
pons
equ
eue
Conflict resolution Before address request is done, request table is checked Memory and caches check request independently
04/18/23 slide 24PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Serialization and SCSerialization and SC 4(4) 4(4)
Serialization to a single location guaranteed 1. Only a single request to each block allowed 2. Request committed when request on bus
Problems to guarantee SC: requires serialization across writes to different
locations requests can be reordered in buffers so being
committed is not same as performed A solution: Servicing incoming requests before processor’s
own requests guarantees write atomicity
04/18/23 slide 25PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Multiple Multiple OOutstanding utstanding PProcessor rocessor RRequestsequests
Modern processors allow multiple outstanding memory operations Problem: may violate sequential consistency Solution: Buffer all outstanding requests Don’t make writes visible to any until committed Don’t perform reads before previously issued
requests are committed Lockup-free caches implement the buffering capability to enforce ordering of uncommitted memory operations
04/18/23 slide 26PCOD: MIMD II Lecture (Coherence)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Commercial Commercial MMachinesachines
VM
E-6
4
SCSI-2
Gra
phics
HPPI
I/O subsystem
Interleavedmemory:
16 GB maximum
Powerpath-2 bus (256 data, 40 address, 47.6 MHz)
R4400 CPUsand caches
(b) Machine organization
SGI Challenge: 36 MIPS R8000 processors with a 1.2 GB/s bus
Peak: 5.4 GFLOPS
GigaplaneTM bus (256 data, 41 address, 83 MHz)
I/O Cards
P
$2
$P
$2
$
mem ctrl
Bus Interface / SwitchBus Interface
CPU/MemCards
Sun Enterprise 6000: 30 UltraSparc processors with 2.67 GB/s bus
Peak: 9 GFLOPS
Look these up on the net