Muge_Snoop Based Multiprocessor Design

Embed Size (px)

Citation preview

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    1/32

    Physical Design of Snoop-Based CacheCoherence in Multiprocessors

    Muge Guher

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    2/32

    Cache Coherence Definition

    A Microprocessor is coherent if the results of any execution of a program canbe reconstructed by a hypothetical serial order.

    Write propagation

    Writes are visible to other processes

    Write serialization

    All writes to the same location are seen in the same order by allprocesses (to all locations called write atomicity)

    E.g., w1 followed by w2 seen by a read from P1, will be seen in the sameorder by all reads by other processors Pi

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    3/32

    Cache Coherence

    Snooping Shared memory multiprocessor environment

    Main Memory is passive

    Caches distribute state transitions to other caches and memory

    All caches listen to snoop messages and act on them

    Most machines use cache coherence protocols with different trade-offs

    But, performance (latency and bandwidth) also depends on physicalimplementation.

    Bus design Cache design

    Integration with memory

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    4/32

    Cache Coherence Requirements

    Protocol Algorithm

    States

    State transitions

    Actions/Outputs

    Physical Design

    Protocol intent is implemented inFSMs

    Cache controller FSM Multiple states per mis

    Bus controller FSM

    Other Controllers

    Support for:

    Multiple Bus transactions

    Multi-Level Caches

    Split-Transaction Busses

    PrRd/

    PrRd/

    PrWr/BusRdXBusRd/

    PrWr/

    S

    M

    I

    BusRdX/Flush

    BusRdX/

    BusRd/Flush

    PrWr/BusRdX

    PrRd/BusRd

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    5/32

    Design Wish List

    Implementation should be

    Correct

    Require minimal extra hardware

    Offer high performance

    High Performance can be achieved with multiple events in progress,overlapping latencies

    Leads to numerous complex interactions between events

    More bugs!

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    6/32

    Design Issues with implementingSnooping

    Cache controller and tags

    Bus side and processor side interactions

    Reporting snoop results: how and when

    Handling write-backs

    Non-atomic state transitions

    Overall set of actions for memory operation are not atomic

    Race conditions

    Atomic operations

    Deadlock, livelock, starvation, serialization.

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    7/32

    Cache Controller and TagsCache controller

    Must monitor bus operations and respond to processor operations

    Two controllers: bus-side, and processor-side

    Bus transactions: Bus-side capture address and perform tag check.

    Fail, snoop miss, no action Hit, cache coherence protocol, RMW on state bits

    For single level caches, duplicate set of tags and state or dual-ported tagand state store

    Controller is an initiator and responder to bus transactions.

    Tags TagsCached Data

    Tags used bythe bus snoope r

    Tags used b ythe proce ssor

    Data is not duplicated Both sets of tags may be

    updated simultaneously

    Single-level snoopy cache organization [1]

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    8/32

    Reporting Snoop ResultsHow does memory know another cache will respond and provide a

    copy of the block so it doesnt have to?Uniprocessor

    Initiator places an address on the bus

    Responder must acknowledge within a time-out window (wired-OR),otherwise bus error.

    Snooping Caches

    All caches must report on the bus before transaction can proceed.

    Snoop result informs main memory, if it should respond or a cache has amodified copy of the block.

    When and how the snoop result is reported on the bus? For Example to implement MESI protocol,

    Memory needs to know; Is block dirty? Should it respond or not?

    Requesting cache needs to know; Is block shared?

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    9/32

    When to report Snoop Results Within fixed number of clock cycles from the address issue on the bus

    Dual set of tags, high priority processor access to the tags.

    Both set is inaccessible during processor updates.

    Extra HW & longer snoop latency, but simple memory subsystem

    Pentium Pro, HP Servers, Sun enterprise.

    After a variable delay

    Memory assumes one of the caches will supply the data, until all havesnooped and indicated results.

    Easier to implement, tag access-conflicts and high performance

    Higher performance, don't have to assume worst case delay

    SGI Challenge, fetches the data and stalls until snoop complete

    Immediately

    Main memory maintains a state bit per block, modified in a cache.

    Complexity introduced to main memory subsystem

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    10/32

    How to Report Snoop Results

    Three wired-OR signals 1,2 : Two for snoop results

    Shared: asserted if any cache has a copy

    Dirty: asserted if some cache has a dirty copy

    Dirty cache knows what action to take 3 : One indicating snoop valid. Inhibit signal, asserted until all processors have

    completed their snoop.

    Illinois MESI protocol allows cache-to-cache transfers.

    Retrieve data from other caches rather than memory.

    Priority scheme needed

    SI Challenge, Sun Enterprise Server, only in exclusive or modified state.

    Challenge updates memory during cache-to-cache transfer (no shared modifiedstate)

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    11/32

    Single Level Snooping Cache

    Addr CmdSnoop state Data buffer

    Write-back buffer

    Cache data RAM

    Comparator

    Comparator

    P

    Tag

    Addr Cmd

    Data

    Addr Cmd

    To

    controller

    System bus

    Bus-side

    controller

    To

    controller

    Tagsandstateforsnoop

    TagsandstateforP

    Processor-side

    controller

    Assumptions

    - Single Level write-back cache- Invalidation protocol- Processor can have onememory request outstanding- The System bus is atomic

    Snooping cache design [1]

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    12/32

    Multi-level Cache HierarchiesHow would a design of a cache controller be modified in case of

    L1/L2 caches?

    Complicates Coherence

    Changes made by the processor to L1 cache may not be visible to L2 cachecontroller, which is responsible for bus operations

    Bus transactions are not directly visible to L1 cache

    A Solution:

    Independent bus snooping HW for each cache level hierarchy

    L1 cache is usually on the processor, on chip snooper consumes pins to

    monitor shared bus Duplicating tags consumes too much on chip area

    Duplication of effort between L1 and L2 snoops.

    Intels 8870 chipset has a snoop filter for quad-core

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    13/32

    How do you guarantee coherence in a multi-level cachehierarchy?

    Better Solution:Based on Inclusion Property

    1. If memory block is in L1 cache it must also be in L2 cache

    2. If the block is in modified state (or shared-modified) in L1 cache, then it mustalso be marked modified in L2 cache, (its copy in L2)

    Therefore:

    only a snooper at L2 is necessary, as it has all the required information

    If a BusRd requests a block that is in modified state in either cache, then L2can wave memory access and inform L1.

    Now information flows both ways:

    L1 accesses L2 for cache miss handling and block state changes;

    L2 forwards to L1 blocks invalidated/updated by bus transactions;

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    14/32

    Inclusion Property

    Difficulties with maintaining the inclusion property:

    L1 and L2 may have different eviction algorithms (replacement differences)

    While a block is kept by L1 it may be evicted by L2

    Separated data and instruction caches.

    Different cache block sizes.

    On a most commonly encountered case, inclusion works automatically:

    L1 is direct mapped

    L2 is either direct mapped or set associative

    Same block size for both caches

    Number of sets in L1 is smaller than in L2

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    15/32

    Explicitly Maintaining InclusionExtend the mechanisms used for propagating coherence events to cache hierarchy.

    Propagate L2 replacements to L1

    Invalidate or flush messages

    Propagate bus transactions from L2 to L1

    Send all transactions to L1 (even if the given block is not present there)

    Add extra state to L2 (a bit per block) which blocks in L2 are also in L1 (inclusion bit)

    On write: propagate modified state from L1 to L2. If L1 is:

    Write-through (so all modifications affect also L2), invalidate

    Write-back : Add per bit state every block in L2, "modified-but-stale"

    Request flush from L1 on Bus read

    L2 serves as a filter for the L1 cache, screening out irrelevant transactions fromthe bus, i.e. dual tags are less critical with multilevel caches

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    16/32

    Propagating transactions forCoherence in Hierarchy

    How is the transaction propagated for multilevel caches? Show someexamples of modern systems.

    Only one transaction on the bus at a time.

    Transactions are propagated up and down the hierarchy, bus transactionsmay be held until propagation completes.

    Performance penalty for holding processor write until BusRdX has beengranted in high, so motivation to de-couple these operation

    Tags TagsCached Data

    Tags used bythe bus snoope r

    Tags used b ythe proce ssor

    TagsCached Da ta

    CachedDataTags

    Tags used mainlyby processor

    Tags used mainlyby bus snooper

    L1 Cache

    L2 Cache

    Two-level snoopy cache organization[1]

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    17/32

    Split Transaction BusIn a Split-transaction bus (STB), transactions that require a response are split in

    two independent sub-transactions: a request transactionand a responsetransaction.

    Arbitrate each phase separately

    Other transactions are allowed to intervene between request & response

    Buffering between bus and the cache controllers allows multipleoutstanding transactions (waiting for snoop and/or data responses)

    Pro: By pipelining bus operations the bus is utilized more efficiently.

    Con: Increased complexity.

    Mem Access Delay

    Address/CMD

    Mem Access Delay

    Data

    Address/CMD

    Data

    Address/CMD

    Busarbitration

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    18/32

    Issues supporting STBs A new request can appear on the bus before the snoop and/or servicing of

    an earlier request are complete;

    these requests may be conflicting requests (same block);

    The number of buffers for incoming requests and potential data responsesfrom bus to cache controller is usually fixed and small, flow control is needed

    Since requests from the bus are buffered, when and how snoop and dataresponses are produced on the bus

    In the same order as requests arrive?

    Snoop and data response together or separately

    Example separately: Sun, together: SGI

    There are 3 phases in a transaction:

    1. A request is put on the bus

    2. Snoop results are sent by other caches

    3. Data is sent for the requesting cache, if needed

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    19/32

    SGI Challenge Example

    Features:

    Does not allow conflicting requests for same block (8 outstanding requests)

    NACK Flow-control

    NACK as soon as request appears on bus, requestor retries

    Separate command (incl. NACK) + address and tag + data buses

    Responses may be in different order than requests

    Order of transactions determined by requests

    Snoop results presented on bus with response

    Examine implementation specifics of:

    Bus design, request response matching

    Snoop results

    Flow Control

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    20/32

    Bus Design

    Two independently arbitrated buses:

    Request: command+address (BusRd, BusWB + target address)

    Response: data

    Match each response to outstanding request, since they arrive out of order

    Tag request with 3-bits (8 outstanding) when launched

    Tag arrives back with corresponding response

    Address bus is free, as tag is sufficient for request matching

    Address and data buses can be arbitrated seperately

    Separate bus lines for arbitration, flow control and snoop results

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    21/32

    Bus and Cache Controller Design

    Cache Controller

    To keep track of outstanding requests on the bus: each cache controller maintains eight entry buffer, request table

    A new request on the bus, added to all request tables at same index,

    Index is 3-bit tag assigned at arbitration

    Table entry contains; block address, request type, state in that cache etc.

    Table is fully associative, new entry can be placed anywhere in table

    Checked for a match by the requesting processor and by all snoopedrequests and responses on the bus

    Entry and tag freed when response is observed on the bus,

    Now tag can be reassigned by bus

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    22/32

    Bus Interface and Request Table

    Addr + cmdSnoop Data buffer

    Write-back buffer

    Comparator

    Tag

    Addr + cmd

    Tocontrol

    TagTag

    Data to/from $

    Request

    buffer

    Request table

    Tag

    7

    Address

    Request +

    Miscella

    neous

    response

    queue

    Addr + cmd bus

    Data + tag bus

    Snoop state

    from $

    state

    Issue +merge

    Write

    backs

    Responses

    check

    0

    Originator

    Myresponse

    informa

    tion

    Response

    queue

    Bus interface logic to accommodate split-transaction bus [1]

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    23/32

    Snoop Results & Request Conflicts

    Variable delay snooping

    Snoop portion of the bus consists of three wired-OR lines

    Sharing, dirty, inhibit

    Request phase determines who will respond, but may take may cycles and

    intervening request response transactions All controllers present their snoop results on bus when they see response

    No data response or snoop results for write backs and upgrades

    Avoid conflicts by:

    Every controller keeps record of pending reads in request table

    Don't issue request for a block with outstanding response

    Writes performed during request phase

    However does not ensure sequential consistency!

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    24/32

    Flow ControlImplement flow control at:

    incoming request buffers from bus to cache controller (write-back buffer)

    Cache subsystem has a response buffer (address + cache block of data)

    limit number of outstanding requests

    Flow control is also needed at main memory,

    Each of the 8 pending requests can generate a write-back to memory Can happen in quick succession on bus

    SGI Challenge: separate NACK lines for address and data buses

    Asserted before ack phase of request (response) cycle is done

    Request (response) cancelled everywhere, and retries later Backoff and priorities to reduce traffic and starvation

    SUN Enterprise: destination initiates retry when it has a free buffer

    source keeps watch for this retry

    guaranteed space will still be there, so only two tries needed at most

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    25/32

    Preventing violation of Sequential Consistency

    SC: Serialization of operations to different locations.

    Multiple outstanding requests on the bus, invalidations are buffered betweenbus and cache and are not applied to cache immediately

    Commitment versus completion

    Value produced by a write commit may not be visible to other processorsCondition necessary for SC: a processor should not be allowed to actually see

    the new value to a write before previous writes (in bus order) are visible to it.

    1. not letting certain types of incoming transactions from bus to cache bereordered in the incoming queues

    2. allowing these re-orderings in the queues, but then ensuring that the importantorders are preserved at the necessary points in the machine.

    3. a simpler approach is to threat all the requests in FIFO order. Although thisapproach is simpler, it can have performance problems;

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    26/32

    Multi-level Caches and STBConsiderable number of cycles for a request to propagate through cache hierarchy

    Allow other transactions to move up and down hierarchy while waiting

    To maintain high bandwidth while allowing the individual units (controllers and caches) tooperate at their own rates, queues are placed between levels of the hierarchy.

    Leads to deadlock and serialization issues

    Response Processor request

    Request/responseto bus

    L1$

    L2$

    1

    27

    8

    Processor

    Bus

    L1$

    L2$

    5

    63

    4

    Processor

    Response/requestfrombus

    Response/

    requestfromL2to L1

    Response/requestfromL1to L2

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    27/32

    DeadlockFetch deadlock:

    Must buffer incoming requests/responses while request outstanding

    One outstanding request per processor => need space to hold p requestsplus one reply (latter is essential)

    If smaller (or if multiple o/s requests), may need to NACK

    Then need priority mechanism in bus arbiter to ensure progress

    Buffer deadlock:

    L1 to L2 queue filled with read requests, waiting for response from L2

    L2 to L1 queue filled with bus requests waiting for response from L1

    Latter condition only when cache closer than lowest level is write back

    Could provide enough buffering, requires a lot of area, not scalable

    Queues may need to support bypassing

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    28/32

    Sequential Consistency

    Separation of commitment from completion even greater with multi level cacheDo not wait for an invalidation to reach all the way up to L1 and return a reply,

    consider write committed when placed on the bus

    Fortunately techniques for single-level cache and ST bus extend, either methodworks:

    dont allow certain re-orderings of transactions at any level

    dont let outgoing operation proceed past level before incominginvalidations/updates at that level are applied

    Sh d C h D i

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    29/32

    Shared Cache DesignsAre there any solutions of shared L2 caches that are based on bus network?How does the bus network need to be modified to support shared caches?

    Benefits of sharing a cache:

    Eliminates the need for cache-coherence at this level

    If L1 cache is shared then there are no multiple copies of a cache blockand hence no coherence problem

    Reduces the latency of communication. L1 communication latency 2-10 clocks, main-memory many times larger

    reduced latency enables finer-grained sharing of data

    Pre-fetching data across processors.

    With private caches each processor incurs miss penalty separately Reduces the BW requirements at the next level of the hierarchy.

    More effective use of long cache blocks, as there is no false sharing;

    Shared cache is smaller than the combined size of the private caches ifworking sets from different processors overlap

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    30/32

    Shared Cache DesignsExtreme case:

    All processors share a L1 cache, below is a shared memory

    Processors are connected to shared cache by a switch, More likely acrossbar to allow cache access by processors in parallel

    Support high BW by interleaving cache and main memory

    Disadvantages of sharing L1:

    higher bandwidth demand

    hit latency to a shared cache is higher than to a private one

    higher cache complexity

    shared caches are usually slower

    instead of constructive interference (like the working set example),destructive interference can occur

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    31/32

    Example of Shared Cache Designs

    Alliant FX-8 machine (1980's),

    8 custom processors

    Clock cycle 170ns

    Processors connected using crossbar to 512Kbyte, 4-way interleaved cache

    Cache: 32 byte block size, direct mapped, write-back, 2 outstanding misses per

    processor

    Encore Multimax (contemporary)

    Snoopy cache coherent multiprocessor

    Each private cache supports two processors instead of one

    Practical approach:

    private L1 caches and a shared L2 cache among groups of processors.

    packaging considerations are also important

  • 8/2/2019 Muge_Snoop Based Multiprocessor Design

    32/32

    References

    [1] David Culler, Jaswinder Pal Singh, and Anoop Gupta, Morgan Kaufmann, Parallel ComputerArchitecture: A Hardware/Software Approach, Morgan Kaufmann; preliminary draft edition(August 1997), pp. 355-417

    [2] Daniel Braga de Faria, Stanfard, Book Summaries, retrieved October 2010, from http://www-cs-students.stanford.edu/~dbfaria/

    [3] Andy Pimentel, Introduction to Parallel Architecture, retrived on October 2010, fromhttp://staff.science.uva.nl/~andy/aci/syl.pdf, pp. 46-52

    [4] R. H. Katz, S. J. Eggers, DA.A. Wood, C.L Perkins and R.G. Shedon, Implementing a cacheConsistency Protocol, Proceedings of the 12th ISCA, 1985, pp. 276-283

    [5] M. S. Papamarcos, J.H. Patel, A low Overhead Coherence Solution for Microprocessors withPrivate Cache Memories , Proceedings of the 11th ISCA, 1984, pp. 348-354

    [6] R. Kumar, V. Zyuban, and D. M. Tullsen, Interconnections in Multi-Core Architectures:Understanding Mechanisms, Overheads and Scaling. In ISCA, Jun 2005.