Physical Design of Snoop-Based Cache Coherence in Multiprocessors Muge Guher

Physical Design of Snoop-Based Cache Coherence in Multiprocessors

Muge Guher

Cache Coherence Definition

A Microprocessor is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order.

Write propagation

Writes are visible to other processes

Write serialization

All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity)

E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi

Cache CoherenceSnooping Shared memory multiprocessor environment Main Memory is passive Caches distribute state transitions to other caches and memory All caches listen to snoop messages and act on them Most machines use cache coherence protocols with different trade-offs

But, performance (latency and bandwidth) also depends on physical implementation.

Bus design Cache design Integration with memory

Cache Coherence Requirements

Protocol Algorithm

States

State transitions

Actions/Outputs

Physical Design

Protocol intent is implemented in FSMs

Cache controller FSM Multiple states per mis

Bus controller FSM Other Controllers

Support for: Multiple Bus transactions Multi-Level Caches Split-Transaction Busses

PrRd/—

PrRd/—

PrWr/BusRdXBusRd/—

PrWr/—

S

M

I

BusRdX/Flush

BusRdX/—

BusRd/Flush

PrWr/BusRdX

PrRd/BusRd

Design Wish List

Implementation should be

Correct

Require minimal extra hardware

Offer high performance

High Performance can be achieved with multiple events in progress, overlapping latencies

Leads to numerous complex interactions between events More bugs!

Design Issues with implementing Snooping

Cache controller and tags

Bus side and processor side interactions

Reporting snoop results: how and when

Handling write-backs

Non-atomic state transitions

Overall set of actions for memory operation are not atomic

Race conditions

Atomic operations

Deadlock, livelock, starvation, serialization.

Cache Controller and TagsCache controller

Must “monitor” bus operations and “respond” to processor operations

Two controllers: bus-side, and processor-side

Bus transactions: Bus-side capture address and perform tag check.

Fail, snoop miss, no action Hit, cache coherence protocol, RMW on state bits

For single level caches, duplicate set of tags and state or dual-ported tag and state store

Controller is an initiator and responder to bus transactions.

Tags TagsCached Data

Tags used bythe bus snooper

Tags used by the processor

Data is not duplicatedBoth sets of tags may be

updated simultaneously

Single-level snoopy cache organization [1]

Reporting Snoop ResultsHow does memory know another cache will respond and provide a

copy of the block so it doesn’t have to?

Uniprocessor

Initiator places an address on the bus

Responder must acknowledge within a time-out window (wired-OR), otherwise bus error.

Snooping Caches

All caches must report on the bus before transaction can proceed.

Snoop result informs main memory, if it should respond or a cache has a modified copy of the block.

When and how the snoop result is reported on the bus?

For Example to implement MESI protocol,

Memory needs to know; Is block dirty? Should it respond or not?

Requesting cache needs to know; Is block shared?

When to report Snoop Results Within fixed number of clock cycles from the address issue on the bus

Dual set of tags, high priority processor access to the tags.

Both set is inaccessible during processor updates.

Extra HW & longer snoop latency, but simple memory subsystem

Pentium Pro, HP Servers, Sun enterprise.

After a variable delay

Memory assumes one of the caches will supply the data, until all have snooped and indicated results.

Easier to implement, tag access-conflicts and high performance

Higher performance, don't have to assume worst case delay

SGI Challenge, fetches the data and stalls until snoop complete

Immediately

Main memory maintains a state bit per block, modified in a cache.

Complexity introduced to main memory subsystem

How to Report Snoop ResultsThree wired-OR signals

1,2 : Two for snoop results

Shared: asserted if any cache has a copy

Dirty: asserted if some cache has a dirty copy

Dirty cache knows what action to take 3 : One indicating snoop valid. Inhibit signal, asserted until all processors have

completed their snoop.

Illinois MESI protocol allows cache-to-cache transfers.

Retrieve data from other caches rather than memory.

Priority scheme needed

SI Challenge, Sun Enterprise Server, only in exclusive or modified state.

Challenge updates memory during cache-to-cache transfer (no shared modified state)

Single Level Snooping Cache

Addr CmdSnoop state Data buffer

Write-back buffer

Cache data RAM

Comparator

Comparator

P

Tag

Addr Cmd

Data

Addr Cmd

Tocontroller

System bus

Bus-side

controllerTocontroller

Tagsandstateforsnoop

TagsandstateforP

Processor-side

controller

Assumptions- Single Level write-back cache- Invalidation protocol- Processor can have one memory request outstanding- The System bus is atomic

Snooping cache design [1]

Multi-level Cache HierarchiesHow would a design of a cache controller be modified in case of

L1/L2 caches?

Complicates Coherence

Changes made by the processor to L1 cache may not be visible to L2 cache controller, which is responsible for bus operations

Bus transactions are not directly visible to L1 cache

A Solution:

Independent bus snooping HW for each cache level hierarchy

L1 cache is usually on the processor, on chip snooper consumes pins to monitor shared bus

Duplicating tags consumes too much on chip area

Duplication of effort between L1 and L2 snoops.

Intel’s 8870 chipset has a “snoop filter” for quad-core

How do you guarantee coherence in a multi-level cache hierarchy?

Better Solution: Based on “Inclusion Property”

1. If memory block is in L1 cache it must also be in L2 cache

2. If the block is in modified state (or shared-modified) in L1 cache, then it must also be marked modified in L2 cache, (its copy in L2)

Therefore: only a snooper at L2 is necessary, as it has all the required information If a BusRd requests a block that is in modified state in either cache, then L2

can wave memory access and inform L1.

Now information flows both ways: L1 accesses L2 for cache miss handling and block state changes; L2 forwards to L1 blocks invalidated/updated by bus transactions;

Inclusion Property

Difficulties with maintaining the inclusion property:

L1 and L2 may have different eviction algorithms (replacement differences)

While a block is kept by L1 it may be evicted by L2

Separated data and instruction caches.

Different cache block sizes.

On a most commonly encountered case, inclusion works automatically:

L1 is direct mapped

L2 is either direct mapped or set associative

Same block size for both caches

Number of sets in L1 is smaller than in L2

Explicitly Maintaining InclusionExtend the mechanisms used for propagating coherence events to cache hierarchy.

Propagate L2 replacements to L1

Invalidate or flush messages

Propagate bus transactions from L2 to L1

Send all transactions to L1 (even if the given block is not present there)

Add extra state to L2 (a bit per block) which blocks in L2 are also in L1 (inclusion bit)

On write: propagate modified state from L1 to L2. If L1 is:

Write-through (so all modifications affect also L2), invalidate

Write-back :

Add per bit state every block in L2, "modified-but-stale"

Request flush from L1 on Bus read

L2 serves as a filter for the L1 cache, screening out irrelevant transactions from the bus, i.e. dual tags are less critical with multilevel caches

Propagating transactions for Coherence in Hierarchy

How is the transaction propagated for multilevel caches? Show some examples of modern systems.

Only one transaction on the bus at a time.

Transactions are propagated up and down the hierarchy, bus transactions may be held until propagation completes.

Performance penalty for holding processor write until BusRdX has been granted in high, so motivation to de-couple these operation

Tags TagsCached Data

Tags used bythe bus snooper

Tags used by the processor

TagsCached Data

CachedDataTags

Tags used mainly by processor

Tags used mainly by bus snooper

L1 Cache

L2 Cache

Two-level snoopy cache organization[1]

Split Transaction BusIn a Split-transaction bus (STB), transactions that require a response are split

in two independent sub-transactions: a request transaction and a response transaction.

Arbitrate each phase separately

Other transactions are allowed to intervene between request & response

Buffering between bus and the cache controllers allows multiple outstanding transactions (waiting for snoop and/or data responses)

Pro: By pipelining bus operations the bus is utilized more efficiently.

Con: Increased complexity.

Mem Access Delay

Address/CMD

Mem Access Delay

Data

Address/CMD

Data

Address/CMD

Busarbitration

Issues supporting STBs A new request can appear on the bus before the snoop and/or servicing of

an earlier request are complete;

these requests may be conflicting requests (same block);

The number of buffers for incoming requests and potential data responses from bus to cache controller is usually fixed and small, flow control is needed

Since requests from the bus are buffered, when and how snoop and data responses are produced on the bus

In the same order as requests arrive?

Snoop and data response together or separately

Example separately: Sun, together: SGI

There are 3 phases in a transaction:

1. A request is put on the bus

2. Snoop results are sent by other caches

3. Data is sent for the requesting cache, if needed

SGI Challenge Example

Features:

Does not allow conflicting requests for same block (8 outstanding requests)

NACK Flow-control

NACK as soon as request appears on bus, requestor retries

Separate command (incl. NACK) + address and tag + data buses

Responses may be in different order than requests

Order of transactions determined by requests

Snoop results presented on bus with response

Examine implementation specifics of:

Bus design, request response matching

Snoop results

Flow Control

How would a design of a cache controller be modified in case of split transactions buses? Show examples of modern split transaction

buses/systems? How many outstanding transactions are allowed?

Bus Design

Two independently arbitrated buses:

Request: command+address (BusRd, BusWB + target address)

Response: data

Match each response to outstanding request, since they arrive out of order

Tag request with 3-bits (8 outstanding) when launched

Tag arrives back with corresponding response

Address bus is free, as tag is sufficient for request matching Address and data buses can be arbitrated seperately

Separate bus lines for arbitration, flow control and snoop results

Bus and Cache Controller Design

Cache Controller To keep track of outstanding requests on the bus:

each cache controller maintains eight entry buffer, “request table”

A new request on the bus, added to all request tables at same index,

Index is 3-bit tag assigned at arbitration

Table entry contains; block address, request type, state in that cache etc.

Table is fully associative, new entry can be placed anywhere in table

Checked for a match by the requesting processor and by all snooped requests and responses on the bus

Entry and tag freed when response is observed on the bus,

Now tag can be reassigned by bus

Bus Interface and Request Table

Addr + cmdSnoop Data buffer

Write-back buffer

Comparator

Tag

Addr + cmd

Tocontrol

TagTag

Data to/from $

Requestbuffer

Request table

Tag

7

Add

ress

Request +

Mis

cella

neo

us

responsequeue

Addr + cmd bus

Data + tag bus

Snoop statefrom $

state

Issue +merge

Writ

e b

ack

s

Re

spon

ses

check

0

Ori

gina

tor

My

resp

ons

e

info

rma

tion

Res

pons

equ

eue

Bus interface logic to accommodate split-transaction bus [1]

Snoop Results & Request Conflicts

Variable delay snooping

Snoop portion of the bus consists of three wired-OR lines

Sharing, dirty, inhibit

Request phase determines who will respond, but may take may cycles and intervening request response transactions

All controllers present their snoop results on bus when they see response

No data response or snoop results for write backs and upgrades

Avoid conflicts by:

Every controller keeps record of pending reads in request table

Don't issue request for a block with outstanding response

Writes performed during request phase

However does not ensure sequential consistency!

Flow ControlImplement flow control at:

incoming request buffers from bus to cache controller (write-back buffer)

Cache subsystem has a response buffer (address + cache block of data)

limit number of outstanding requests

Flow control is also needed at main memory, Each of the 8 pending requests can generate a write-back to memory

Can happen in quick succession on bus

SGI Challenge: separate NACK lines for address and data buses

Asserted before ack phase of request (response) cycle is done

Request (response) cancelled everywhere, and retries later

Backoff and priorities to reduce traffic and starvation

SUN Enterprise: destination initiates retry when it has a free buffer

source keeps watch for this retry

guaranteed space will still be there, so only two “tries” needed at most

Preventing violation of Sequential Consistency

SC: Serialization of operations to different locations.

Multiple outstanding requests on the bus, invalidations are buffered between bus and cache and are not applied to cache immediately

Commitment versus completion

Value produced by a write commit may not be visible to other processors

Condition necessary for SC: a processor should not be allowed to actually see the new value to a write before previous writes (in bus order) are visible to it.

1. not letting certain types of incoming transactions from bus to cache be reordered in the incoming queues

2. allowing these re-orderings in the queues, but then ensuring that the important orders are preserved at the necessary points in the machine.

3. a simpler approach is to threat all the requests in FIFO order. Although this approach is simpler, it can have performance problems;

Multi-level Caches and STBConsiderable number of cycles for a request to propagate through cache hierarchy

Allow other transactions to move up and down hierarchy while waiting

To maintain high bandwidth while allowing the individual units (controllers and caches) to operate at their own rates, queues are placed between levels of the hierarchy.

Leads to deadlock and serialization issues

Response Processor request

Request/responseto bus

L1 $

L2 $

1

27

8

Processor

Bus

L1 $

L2 $

5

63

4

Processor

Response/requestfrom bus

Response/requestfrom L2 to L1

Response/requestfrom L1 to L2

DeadlockFetch deadlock:

Must buffer incoming requests/responses while request outstanding

One outstanding request per processor => need space to hold p requests plus one reply (latter is essential)

If smaller (or if multiple o/s requests), may need to NACK

Then need priority mechanism in bus arbiter to ensure progress

Buffer deadlock:

L1 to L2 queue filled with read requests, waiting for response from L2

L2 to L1 queue filled with bus requests waiting for response from L1

Latter condition only when cache closer than lowest level is write back

Could provide enough buffering, requires a lot of area, not scalable

Queues may need to support bypassing

Sequential Consistency

Separation of commitment from completion even greater with multi level cacheDo not wait for an invalidation to reach all the way up to L1 and return a reply,

consider write committed when placed on the bus

Fortunately techniques for single-level cache and ST bus extend, either method works:

don’t allow certain re-orderings of transactions at any level

don’t let outgoing operation proceed past level before incoming invalidations/updates at that level are applied

Shared Cache DesignsAre there any solutions of shared L2 caches that are based on bus network? How does the bus network need to be modified to support shared caches?

Benefits of sharing a cache:

Eliminates the need for cache-coherence at this level

If L1 cache is shared then there are no multiple copies of a cache block and hence no coherence problem

Reduces the latency of communication.

L1 communication latency 2-10 clocks, main-memory many times larger

reduced latency enables finer-grained sharing of data

Pre-fetching data across processors.

With private caches each processor incurs miss penalty separately

Reduces the BW requirements at the next level of the hierarchy.

More effective use of long cache blocks, as there is no false sharing;

Shared cache is smaller than the combined size of the private caches if working sets from different processors overlap

Shared Cache DesignsExtreme case:

All processors share a L1 cache, below is a shared memory

Processors are connected to shared cache by a switch, More likely a crossbar to allow cache access by processors in parallel

Support high BW by interleaving cache and main memory

Disadvantages of sharing L1:

higher bandwidth demand

hit latency to a shared cache is higher than to a private one

higher cache complexity

shared caches are usually slower

instead of constructive interference (like the working set example), destructive interference can occur

Example of Shared Cache Designs

Alliant FX-8 machine (1980's),

8 custom processors

Clock cycle 170ns

Processors connected using crossbar to 512Kbyte, 4-way interleaved cache

Cache: 32 byte block size, direct mapped, write-back, 2 outstanding misses per processor

Encore Multimax (contemporary)

Snoopy cache coherent multiprocessor

Each private cache supports two processors instead of one

Practical approach:

private L1 caches and a shared L2 cache among groups of processors.

packaging considerations are also important

References

[1] David Culler, Jaswinder Pal Singh, and Anoop Gupta, Morgan Kaufmann, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann; preliminary draft edition (August 1997), pp. 355-417

[2] Daniel Braga de Faria, Stanfard, Book Summaries, retrieved October 2010, from http://www-cs-students.stanford.edu/~dbfaria/

[3] Andy Pimentel, “Introduction to Parallel Architecture”, retrived on October 2010, from http://staff.science.uva.nl/~andy/aci/syl.pdf, pp. 46-52

[4] R. H. Katz, S. J. Eggers, DA.A. Wood, C.L Perkins and R.G. Shedon, “Implementing a cache Consistency Protocol”, Proceedings of the 12th ISCA, 1985, pp. 276-283

[5] M. S. Papamarcos, J.H. Patel, “A low Overhead Coherence Solution for Microprocessors with Private Cache Memories” , Proceedings of the 11th ISCA, 1984, pp. 348-354

[6] R. Kumar, V. Zyuban, and D. M. Tullsen, “Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling”. In ISCA, Jun 2005.

Documents

Physical Design of Snoop-Based Cache Coherence in Multiprocessors Muge Guher