Upload
03458337271
View
218
Download
0
Embed Size (px)
Citation preview
8/2/2019 Muge_Snoop Based Multiprocessor Design
1/32
Physical Design of Snoop-Based CacheCoherence in Multiprocessors
Muge Guher
8/2/2019 Muge_Snoop Based Multiprocessor Design
2/32
Cache Coherence Definition
A Microprocessor is coherent if the results of any execution of a program canbe reconstructed by a hypothetical serial order.
Write propagation
Writes are visible to other processes
Write serialization
All writes to the same location are seen in the same order by allprocesses (to all locations called write atomicity)
E.g., w1 followed by w2 seen by a read from P1, will be seen in the sameorder by all reads by other processors Pi
8/2/2019 Muge_Snoop Based Multiprocessor Design
3/32
Cache Coherence
Snooping Shared memory multiprocessor environment
Main Memory is passive
Caches distribute state transitions to other caches and memory
All caches listen to snoop messages and act on them
Most machines use cache coherence protocols with different trade-offs
But, performance (latency and bandwidth) also depends on physicalimplementation.
Bus design Cache design
Integration with memory
8/2/2019 Muge_Snoop Based Multiprocessor Design
4/32
Cache Coherence Requirements
Protocol Algorithm
States
State transitions
Actions/Outputs
Physical Design
Protocol intent is implemented inFSMs
Cache controller FSM Multiple states per mis
Bus controller FSM
Other Controllers
Support for:
Multiple Bus transactions
Multi-Level Caches
Split-Transaction Busses
PrRd/
PrRd/
PrWr/BusRdXBusRd/
PrWr/
S
M
I
BusRdX/Flush
BusRdX/
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd
8/2/2019 Muge_Snoop Based Multiprocessor Design
5/32
Design Wish List
Implementation should be
Correct
Require minimal extra hardware
Offer high performance
High Performance can be achieved with multiple events in progress,overlapping latencies
Leads to numerous complex interactions between events
More bugs!
8/2/2019 Muge_Snoop Based Multiprocessor Design
6/32
Design Issues with implementingSnooping
Cache controller and tags
Bus side and processor side interactions
Reporting snoop results: how and when
Handling write-backs
Non-atomic state transitions
Overall set of actions for memory operation are not atomic
Race conditions
Atomic operations
Deadlock, livelock, starvation, serialization.
8/2/2019 Muge_Snoop Based Multiprocessor Design
7/32
Cache Controller and TagsCache controller
Must monitor bus operations and respond to processor operations
Two controllers: bus-side, and processor-side
Bus transactions: Bus-side capture address and perform tag check.
Fail, snoop miss, no action Hit, cache coherence protocol, RMW on state bits
For single level caches, duplicate set of tags and state or dual-ported tagand state store
Controller is an initiator and responder to bus transactions.
Tags TagsCached Data
Tags used bythe bus snoope r
Tags used b ythe proce ssor
Data is not duplicated Both sets of tags may be
updated simultaneously
Single-level snoopy cache organization [1]
8/2/2019 Muge_Snoop Based Multiprocessor Design
8/32
Reporting Snoop ResultsHow does memory know another cache will respond and provide a
copy of the block so it doesnt have to?Uniprocessor
Initiator places an address on the bus
Responder must acknowledge within a time-out window (wired-OR),otherwise bus error.
Snooping Caches
All caches must report on the bus before transaction can proceed.
Snoop result informs main memory, if it should respond or a cache has amodified copy of the block.
When and how the snoop result is reported on the bus? For Example to implement MESI protocol,
Memory needs to know; Is block dirty? Should it respond or not?
Requesting cache needs to know; Is block shared?
8/2/2019 Muge_Snoop Based Multiprocessor Design
9/32
When to report Snoop Results Within fixed number of clock cycles from the address issue on the bus
Dual set of tags, high priority processor access to the tags.
Both set is inaccessible during processor updates.
Extra HW & longer snoop latency, but simple memory subsystem
Pentium Pro, HP Servers, Sun enterprise.
After a variable delay
Memory assumes one of the caches will supply the data, until all havesnooped and indicated results.
Easier to implement, tag access-conflicts and high performance
Higher performance, don't have to assume worst case delay
SGI Challenge, fetches the data and stalls until snoop complete
Immediately
Main memory maintains a state bit per block, modified in a cache.
Complexity introduced to main memory subsystem
8/2/2019 Muge_Snoop Based Multiprocessor Design
10/32
How to Report Snoop Results
Three wired-OR signals 1,2 : Two for snoop results
Shared: asserted if any cache has a copy
Dirty: asserted if some cache has a dirty copy
Dirty cache knows what action to take 3 : One indicating snoop valid. Inhibit signal, asserted until all processors have
completed their snoop.
Illinois MESI protocol allows cache-to-cache transfers.
Retrieve data from other caches rather than memory.
Priority scheme needed
SI Challenge, Sun Enterprise Server, only in exclusive or modified state.
Challenge updates memory during cache-to-cache transfer (no shared modifiedstate)
8/2/2019 Muge_Snoop Based Multiprocessor Design
11/32
Single Level Snooping Cache
Addr CmdSnoop state Data buffer
Write-back buffer
Cache data RAM
Comparator
Comparator
P
Tag
Addr Cmd
Data
Addr Cmd
To
controller
System bus
Bus-side
controller
To
controller
Tagsandstateforsnoop
TagsandstateforP
Processor-side
controller
Assumptions
- Single Level write-back cache- Invalidation protocol- Processor can have onememory request outstanding- The System bus is atomic
Snooping cache design [1]
8/2/2019 Muge_Snoop Based Multiprocessor Design
12/32
Multi-level Cache HierarchiesHow would a design of a cache controller be modified in case of
L1/L2 caches?
Complicates Coherence
Changes made by the processor to L1 cache may not be visible to L2 cachecontroller, which is responsible for bus operations
Bus transactions are not directly visible to L1 cache
A Solution:
Independent bus snooping HW for each cache level hierarchy
L1 cache is usually on the processor, on chip snooper consumes pins to
monitor shared bus Duplicating tags consumes too much on chip area
Duplication of effort between L1 and L2 snoops.
Intels 8870 chipset has a snoop filter for quad-core
8/2/2019 Muge_Snoop Based Multiprocessor Design
13/32
How do you guarantee coherence in a multi-level cachehierarchy?
Better Solution:Based on Inclusion Property
1. If memory block is in L1 cache it must also be in L2 cache
2. If the block is in modified state (or shared-modified) in L1 cache, then it mustalso be marked modified in L2 cache, (its copy in L2)
Therefore:
only a snooper at L2 is necessary, as it has all the required information
If a BusRd requests a block that is in modified state in either cache, then L2can wave memory access and inform L1.
Now information flows both ways:
L1 accesses L2 for cache miss handling and block state changes;
L2 forwards to L1 blocks invalidated/updated by bus transactions;
8/2/2019 Muge_Snoop Based Multiprocessor Design
14/32
Inclusion Property
Difficulties with maintaining the inclusion property:
L1 and L2 may have different eviction algorithms (replacement differences)
While a block is kept by L1 it may be evicted by L2
Separated data and instruction caches.
Different cache block sizes.
On a most commonly encountered case, inclusion works automatically:
L1 is direct mapped
L2 is either direct mapped or set associative
Same block size for both caches
Number of sets in L1 is smaller than in L2
8/2/2019 Muge_Snoop Based Multiprocessor Design
15/32
Explicitly Maintaining InclusionExtend the mechanisms used for propagating coherence events to cache hierarchy.
Propagate L2 replacements to L1
Invalidate or flush messages
Propagate bus transactions from L2 to L1
Send all transactions to L1 (even if the given block is not present there)
Add extra state to L2 (a bit per block) which blocks in L2 are also in L1 (inclusion bit)
On write: propagate modified state from L1 to L2. If L1 is:
Write-through (so all modifications affect also L2), invalidate
Write-back : Add per bit state every block in L2, "modified-but-stale"
Request flush from L1 on Bus read
L2 serves as a filter for the L1 cache, screening out irrelevant transactions fromthe bus, i.e. dual tags are less critical with multilevel caches
8/2/2019 Muge_Snoop Based Multiprocessor Design
16/32
Propagating transactions forCoherence in Hierarchy
How is the transaction propagated for multilevel caches? Show someexamples of modern systems.
Only one transaction on the bus at a time.
Transactions are propagated up and down the hierarchy, bus transactionsmay be held until propagation completes.
Performance penalty for holding processor write until BusRdX has beengranted in high, so motivation to de-couple these operation
Tags TagsCached Data
Tags used bythe bus snoope r
Tags used b ythe proce ssor
TagsCached Da ta
CachedDataTags
Tags used mainlyby processor
Tags used mainlyby bus snooper
L1 Cache
L2 Cache
Two-level snoopy cache organization[1]
8/2/2019 Muge_Snoop Based Multiprocessor Design
17/32
Split Transaction BusIn a Split-transaction bus (STB), transactions that require a response are split in
two independent sub-transactions: a request transactionand a responsetransaction.
Arbitrate each phase separately
Other transactions are allowed to intervene between request & response
Buffering between bus and the cache controllers allows multipleoutstanding transactions (waiting for snoop and/or data responses)
Pro: By pipelining bus operations the bus is utilized more efficiently.
Con: Increased complexity.
Mem Access Delay
Address/CMD
Mem Access Delay
Data
Address/CMD
Data
Address/CMD
Busarbitration
8/2/2019 Muge_Snoop Based Multiprocessor Design
18/32
Issues supporting STBs A new request can appear on the bus before the snoop and/or servicing of
an earlier request are complete;
these requests may be conflicting requests (same block);
The number of buffers for incoming requests and potential data responsesfrom bus to cache controller is usually fixed and small, flow control is needed
Since requests from the bus are buffered, when and how snoop and dataresponses are produced on the bus
In the same order as requests arrive?
Snoop and data response together or separately
Example separately: Sun, together: SGI
There are 3 phases in a transaction:
1. A request is put on the bus
2. Snoop results are sent by other caches
3. Data is sent for the requesting cache, if needed
8/2/2019 Muge_Snoop Based Multiprocessor Design
19/32
SGI Challenge Example
Features:
Does not allow conflicting requests for same block (8 outstanding requests)
NACK Flow-control
NACK as soon as request appears on bus, requestor retries
Separate command (incl. NACK) + address and tag + data buses
Responses may be in different order than requests
Order of transactions determined by requests
Snoop results presented on bus with response
Examine implementation specifics of:
Bus design, request response matching
Snoop results
Flow Control
8/2/2019 Muge_Snoop Based Multiprocessor Design
20/32
Bus Design
Two independently arbitrated buses:
Request: command+address (BusRd, BusWB + target address)
Response: data
Match each response to outstanding request, since they arrive out of order
Tag request with 3-bits (8 outstanding) when launched
Tag arrives back with corresponding response
Address bus is free, as tag is sufficient for request matching
Address and data buses can be arbitrated seperately
Separate bus lines for arbitration, flow control and snoop results
8/2/2019 Muge_Snoop Based Multiprocessor Design
21/32
Bus and Cache Controller Design
Cache Controller
To keep track of outstanding requests on the bus: each cache controller maintains eight entry buffer, request table
A new request on the bus, added to all request tables at same index,
Index is 3-bit tag assigned at arbitration
Table entry contains; block address, request type, state in that cache etc.
Table is fully associative, new entry can be placed anywhere in table
Checked for a match by the requesting processor and by all snoopedrequests and responses on the bus
Entry and tag freed when response is observed on the bus,
Now tag can be reassigned by bus
8/2/2019 Muge_Snoop Based Multiprocessor Design
22/32
Bus Interface and Request Table
Addr + cmdSnoop Data buffer
Write-back buffer
Comparator
Tag
Addr + cmd
Tocontrol
TagTag
Data to/from $
Request
buffer
Request table
Tag
7
Address
Request +
Miscella
neous
response
queue
Addr + cmd bus
Data + tag bus
Snoop state
from $
state
Issue +merge
Write
backs
Responses
check
0
Originator
Myresponse
informa
tion
Response
queue
Bus interface logic to accommodate split-transaction bus [1]
8/2/2019 Muge_Snoop Based Multiprocessor Design
23/32
Snoop Results & Request Conflicts
Variable delay snooping
Snoop portion of the bus consists of three wired-OR lines
Sharing, dirty, inhibit
Request phase determines who will respond, but may take may cycles and
intervening request response transactions All controllers present their snoop results on bus when they see response
No data response or snoop results for write backs and upgrades
Avoid conflicts by:
Every controller keeps record of pending reads in request table
Don't issue request for a block with outstanding response
Writes performed during request phase
However does not ensure sequential consistency!
8/2/2019 Muge_Snoop Based Multiprocessor Design
24/32
Flow ControlImplement flow control at:
incoming request buffers from bus to cache controller (write-back buffer)
Cache subsystem has a response buffer (address + cache block of data)
limit number of outstanding requests
Flow control is also needed at main memory,
Each of the 8 pending requests can generate a write-back to memory Can happen in quick succession on bus
SGI Challenge: separate NACK lines for address and data buses
Asserted before ack phase of request (response) cycle is done
Request (response) cancelled everywhere, and retries later Backoff and priorities to reduce traffic and starvation
SUN Enterprise: destination initiates retry when it has a free buffer
source keeps watch for this retry
guaranteed space will still be there, so only two tries needed at most
8/2/2019 Muge_Snoop Based Multiprocessor Design
25/32
Preventing violation of Sequential Consistency
SC: Serialization of operations to different locations.
Multiple outstanding requests on the bus, invalidations are buffered betweenbus and cache and are not applied to cache immediately
Commitment versus completion
Value produced by a write commit may not be visible to other processorsCondition necessary for SC: a processor should not be allowed to actually see
the new value to a write before previous writes (in bus order) are visible to it.
1. not letting certain types of incoming transactions from bus to cache bereordered in the incoming queues
2. allowing these re-orderings in the queues, but then ensuring that the importantorders are preserved at the necessary points in the machine.
3. a simpler approach is to threat all the requests in FIFO order. Although thisapproach is simpler, it can have performance problems;
8/2/2019 Muge_Snoop Based Multiprocessor Design
26/32
Multi-level Caches and STBConsiderable number of cycles for a request to propagate through cache hierarchy
Allow other transactions to move up and down hierarchy while waiting
To maintain high bandwidth while allowing the individual units (controllers and caches) tooperate at their own rates, queues are placed between levels of the hierarchy.
Leads to deadlock and serialization issues
Response Processor request
Request/responseto bus
L1$
L2$
1
27
8
Processor
Bus
L1$
L2$
5
63
4
Processor
Response/requestfrombus
Response/
requestfromL2to L1
Response/requestfromL1to L2
8/2/2019 Muge_Snoop Based Multiprocessor Design
27/32
DeadlockFetch deadlock:
Must buffer incoming requests/responses while request outstanding
One outstanding request per processor => need space to hold p requestsplus one reply (latter is essential)
If smaller (or if multiple o/s requests), may need to NACK
Then need priority mechanism in bus arbiter to ensure progress
Buffer deadlock:
L1 to L2 queue filled with read requests, waiting for response from L2
L2 to L1 queue filled with bus requests waiting for response from L1
Latter condition only when cache closer than lowest level is write back
Could provide enough buffering, requires a lot of area, not scalable
Queues may need to support bypassing
8/2/2019 Muge_Snoop Based Multiprocessor Design
28/32
Sequential Consistency
Separation of commitment from completion even greater with multi level cacheDo not wait for an invalidation to reach all the way up to L1 and return a reply,
consider write committed when placed on the bus
Fortunately techniques for single-level cache and ST bus extend, either methodworks:
dont allow certain re-orderings of transactions at any level
dont let outgoing operation proceed past level before incominginvalidations/updates at that level are applied
Sh d C h D i
8/2/2019 Muge_Snoop Based Multiprocessor Design
29/32
Shared Cache DesignsAre there any solutions of shared L2 caches that are based on bus network?How does the bus network need to be modified to support shared caches?
Benefits of sharing a cache:
Eliminates the need for cache-coherence at this level
If L1 cache is shared then there are no multiple copies of a cache blockand hence no coherence problem
Reduces the latency of communication. L1 communication latency 2-10 clocks, main-memory many times larger
reduced latency enables finer-grained sharing of data
Pre-fetching data across processors.
With private caches each processor incurs miss penalty separately Reduces the BW requirements at the next level of the hierarchy.
More effective use of long cache blocks, as there is no false sharing;
Shared cache is smaller than the combined size of the private caches ifworking sets from different processors overlap
8/2/2019 Muge_Snoop Based Multiprocessor Design
30/32
Shared Cache DesignsExtreme case:
All processors share a L1 cache, below is a shared memory
Processors are connected to shared cache by a switch, More likely acrossbar to allow cache access by processors in parallel
Support high BW by interleaving cache and main memory
Disadvantages of sharing L1:
higher bandwidth demand
hit latency to a shared cache is higher than to a private one
higher cache complexity
shared caches are usually slower
instead of constructive interference (like the working set example),destructive interference can occur
8/2/2019 Muge_Snoop Based Multiprocessor Design
31/32
Example of Shared Cache Designs
Alliant FX-8 machine (1980's),
8 custom processors
Clock cycle 170ns
Processors connected using crossbar to 512Kbyte, 4-way interleaved cache
Cache: 32 byte block size, direct mapped, write-back, 2 outstanding misses per
processor
Encore Multimax (contemporary)
Snoopy cache coherent multiprocessor
Each private cache supports two processors instead of one
Practical approach:
private L1 caches and a shared L2 cache among groups of processors.
packaging considerations are also important
8/2/2019 Muge_Snoop Based Multiprocessor Design
32/32
References
[1] David Culler, Jaswinder Pal Singh, and Anoop Gupta, Morgan Kaufmann, Parallel ComputerArchitecture: A Hardware/Software Approach, Morgan Kaufmann; preliminary draft edition(August 1997), pp. 355-417
[2] Daniel Braga de Faria, Stanfard, Book Summaries, retrieved October 2010, from http://www-cs-students.stanford.edu/~dbfaria/
[3] Andy Pimentel, Introduction to Parallel Architecture, retrived on October 2010, fromhttp://staff.science.uva.nl/~andy/aci/syl.pdf, pp. 46-52
[4] R. H. Katz, S. J. Eggers, DA.A. Wood, C.L Perkins and R.G. Shedon, Implementing a cacheConsistency Protocol, Proceedings of the 12th ISCA, 1985, pp. 276-283
[5] M. S. Papamarcos, J.H. Patel, A low Overhead Coherence Solution for Microprocessors withPrivate Cache Memories , Proceedings of the 11th ISCA, 1984, pp. 348-354
[6] R. Kumar, V. Zyuban, and D. M. Tullsen, Interconnections in Multi-Core Architectures:Understanding Mechanisms, Overheads and Scaling. In ISCA, Jun 2005.