Mc Main Cache en-nup

Embed Size (px)

Citation preview

  • 8/13/2019 Mc Main Cache en-nup

    1/14

    Multi-Core Programming -Caches and Memory Hierarchies

    Prof. Dr. Gudula Runger

    Professur Praktische InformatikTechnische U niversi tat Ch emnitz

    Winter Term 2013/2014

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 1 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 2 / 54

    Key characteristics of Hardware Development

    Growth of DRAM capacity (60% per year) (1980-98)

    Growth of processor performance (55-80% per year)

    Reduction of DRAM access time approx. 25 % per year

    Introduction ofcachesand memory hierarchiesbetween processor and main memory

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 3 / 54

    Cache Memory

    SRAM (Static Random Access Memory)-Chip (0.5ns5ns)instead of 45ns70ns when using DRAM (ns=nano second=109sec)

    Reload strategy of data blocks (cache block, cacheline) of constant size fromthe main memory into the cache

    HS C Pblock word

    Data is accessed over the cache:

    Cache hit orCache miss

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 4 / 54

  • 8/13/2019 Mc Main Cache en-nup

    2/14

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 3 / 54 Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 4 / 54

    Locality of Memory Accesses

    Spatial Locality

    Neighboring (spatial) accesses to memory locations of main memory atconsecutive points in time during program execution

    Some of the successive memory accesses will access the same cache line and nofurther reloading from memory is necessary (the use of cache lines comprisingseveral memory words is based on the assumption that most programs exhibit

    spatial locality).

    Temporal Locality

    The samememory location is accessed multiple times at consecutive points intime during program execution

    After loading the memory words of the cache block are accessed multiple timesbefore the cache block is replaced again.

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 5 / 54

    Cache Characteristics

    Cache line size

    The size of the block of consecutive memory locations which are loaded as a whole. (Idea:Spatial locality of the program is assumed, i.e. access to a memory location at time t, afterwards

    access to neighboring memory location at time t+ 1) typical size: 48 words (3264 bytes)

    Cache size

    Number of Bytes a cache is able to store typical size: 8128 memory words

    Cache associativity

    Mapping of cache lines to memory blocks The associativity determines at how manypositions in the cache a memory block can be stored.

    a) Direct-mapped cache:Each memory block can be stored at exactly one position in the cache

    b) Fully associative cache:Each memory block can be stored at an arbitrary location in the cache

    c) Set associative cache:Each memory block can be stored at a fixed numberof positions in the cache

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 6 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-SnoopingWrite-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 7 / 54

    Cache Associativity

    Main memory: n= 2s blocks Bj,j= 0, . . . , n 1Cache memory: m= 2r cache lines Bi, i= 0, . . . ,m 1

    Memory block, cache block: l= 2w

    memory wordsTag for identifying a cached memory block depends on cache methods

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 8 / 54

  • 8/13/2019 Mc Main Cache en-nup

    3/14

    Direct Mapped CacheExactly one one cache line for a memory block.

    E.g. Bj Bi with i= jmod m, i.ecache block memory block

    0 0,m, 2m, . . . , 2sm1 1,m+ 1 . . . , 2sm+ 1...

    ...m 1 m 1, . . . 2s 1

    n/m= 2sr

    memory blocks for each cache blockAccess to a memory location using the address of the valuememory address = block address + word address

    (address of the (relative addressmemory block) in the memory block)s Bits w Bits

    (identifies 2s blocks) (identifies 2w words)

    sr

    r

    w

    tag address of the

    cache block

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 9 / 54

    {

    000001010011100101110111

    cache block

    differenttags

    tag{

    samecache block

    Memory access (steps):

    - Identifying the memory block (r rightmost bits of the leftmost s-bit word)

    - Comparison of tags (s r leftmost bits) with the s rleftmost bits of thememory address:

    match cache hit no match cache miss, reloading

    Disadvantage:

    - Only one position per memory block in the cache It is possible that memory blocks for one cache position are continuallyloaded and replaced in the cache (trashing)

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 10 / 54

    Fully-Associative CacheEach memory block can be placed in any cache position.(Overcoming the disadvantage of direct mapped caches.)

    memory address = block address + word address (2s blocks of(s Bits) (w Bits) 2w words)

    s

    w

    whole block addressis used as tag

    Memory Access: All tag-entries in the cache must be searched.

    Advantage: - increased flexibility when loading memory blocks into cacheDisadvantage: - Expensive Search (complex circuit or time delays)

    Example:

    Cache of size 64KByte, cache block size is 4Byte (w=2) 16 K = 214

    blocks within the cache (r=14)size of main memory 16 MBytes = 224 222 blocks (s=22)

    Tags are 22 Bit For each 32-Bit-Block (4Bytes) a 22Bit tag has to be stored.

    In comparison a direct-mapped cache: r=14 s r= 22 14 = 8i.e. tags of size 8Bit and the cache block must not be searched)

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 11 / 54

    Set-Associative Cache

    The cache is partitioned into v sets S0. . .Sv1(v= 2d), each set consists

    ofk= mv

    cache blocks.

    Mapping of memory blocks to sets:

    Bj Si with i= j mod v, i .e. set memory block0 0, v, 2v, . . . , 2s v1 1, v+ 1 . . . , 2s v+ 1

    ... ...v 1 v 1, . . . 2s 1

    A memory block can be placed in any cache block of its set.

    memory address = block address + word address(sBits) (w Bits)

    sd

    d

    w

    Tag Set Si

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 12 / 54

  • 8/13/2019 Mc Main Cache en-nup

    4/14

    000001010011100

    111

    101

    {

    110

    00 010 0

    11

    {{

    { {

    Set S

    Set S1

    0 {4 cache blocks

    Tag SetTag Sets S and S0 1

    all memory block

    adresses

    Memory Access(steps):

    - Identifying the set Si for block Bj

    - Comparisons of the memory address (s dleftmost bits) with the tags of thecache blocks within the set Si

    Special cases:

    1) v= m and k= 1: set-associative cache = direct mapped cache

    2) v= 1 and k= m: set-associative cache = fully associative cache

    3) v= m2 and k= 2: 2-way set-associative

    4) v= m4 and k= 4: 4-way set-associative

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 13 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 14 / 54

    Block Replacement Strategies

    a) Direct-mapped cache:Only one cache position is used, which is replaced with the new value.

    b)c) Fully-associative cache or set-associative cache:

    - LRU-Replacement Strategy (Least Recently Used):

    For a two-way set-associative cache a additional USE-Bit is used,i.e. set USE-Bit=1 if the first cache block was referenced and set USE-Bit=0for the second cache block of the set

    - LFU-Replacement strategy (Least FrequentlyUsed)The number of accesses is used.

    - Random

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 15 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-SnoopingWrite-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 16 / 54

  • 8/13/2019 Mc Main Cache en-nup

    5/14

    Write Back StrategiesWrite operation on cache hit:

    Write operations change values in the cache;

    Reading correct values on read access must be ensured.

    The modified value has to be updated in main memory, too;

    Write back Strategies determine the time and the strategy of updatingthe main memory:

    Write-through StrategyWrite-back Strategy

    Write-through Strategy

    After a write operation in the cache has been performed, the correspondingmemory operation has to be performed in main memory directly (write-through). Cache blocks and corresponding memory blocks have always the same value I/O-Peripherals or additional processors have always a recent view of thememory.Problem: Write back operation takes a long time write stall for the processorRemedy: Usage of write buffers

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 17 / 54

    Write-back Strategy

    Write operations are at first performed in the cache.When replacing the modified cache block the corresponding memory block isupdated (write back).

    Cache block and corresponding memory block possibly contain differentvalues.

    Realization of the Write-back strategy:

    A dirty-Bit describes, if a cache block was modified.

    Only modified cache blocks must be written back.

    Advantage: Less main memory operations

    Disadvantage: Main memory may contain invalid values I/O-Peripheralscan not access the main memory directly but have to access the cache

    Write operations on a cache access:

    a) Write allocate (fetch on write): A memory block is fetched from mainmemory and treated like a cache hit afterwards.

    b) No-write allocate (write around): On cache miss a memory block is modifiedonly modified on main memory.

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 18 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-SnoopingWrite-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 19 / 54

    Cache Coherency

    Situation: Multiprocessor with a local cache for each processor

    Memory Coherency Problem

    The same memory block is potentially located in different local caches at thesame time. After modificationlocal caches and the main memory may containdifferent inconsistent values for the same memory location.

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 20 / 54

  • 8/13/2019 Mc Main Cache en-nup

    6/14

    Example for Cache Coherency

    - Bus based SMP-System with processor Piand local cache Ci, i= 1, 2, 3

    - Write-through strategy

    - Variable u located in memory Mhas value 5

    time actiont

    1 P

    1 reads variable u block with uwill be loaded into C

    1t2 P3 reads variable u block with uwill be loaded into C3t3 P3 writes 7 into u changing the value in memory Mt4 P1 reads variable u P1 and gets value 5, because uis cached.

    (If using write-back strategy, P1 will read 5 as well)t5 P2 reads variable u P2 reads the value 7 from M

    (If using write-back strategy, P2 will read the value 5).

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 21 / 54

    Coherency of Memory

    Describes the behavior on read- and write-accesses to the same memorylocation fromseveral processors of a multi processor system.(CoherencyContext; logical, argumentative Coherence)

    Informal: A memory system is coherent, if for each memory location a readaccess return the last value written.

    Which value is the last value written in a multi-processor system?

    (Processors may write at the same time to a memory location)Clarifying the Notation of Coherence

    - As a measure of time not the time of a physical read or write is used, butthe position of the operation in the program order.

    - Without local caches the memory system is coherent, because of the mainmemory leads to a sequentialization of memory accesses to a memorylocation and the memory accesses are done in program order. (writesequentialization)

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 22 / 54

    Notation of Coherence

    A memory system is coherent, if the following conditions are true:

    (1) If processor Pwrites to memory location xat time t1 and reads xat timet2> t1 and if no other processor writes to xin the interval between t1 andt2, than Preads at time t2 the value written at time t1.(i.e. for each processor of the parallel system the program order ismaintained)(Counterexample: Write-back Strategy and at time t0< t1 a processor P

    writes to x, than the cache block gets replaced at time t1> t

    1)

    (2) If processor P1 writes to memory location xat time t1 and processor P2 readsmemory location xat time t2 and ifno other processor writes to x in theinterval between t1 and t2 and the interval t2 t1 is long enough, then P2reads the value written from P1(after a certain amount of time the new value is visible)

    (3) If any two processors write the same memory location x, then the writeaccesses are sequentialized, i.e. all processors realize the write accesses inthe same order. (global sequentialization)

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 23 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-SnoopingWrite-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 24 / 54

  • 8/13/2019 Mc Main Cache en-nup

    7/14

    Bus-Snooping

    - Possible realization of cache coherency in a bus based SMP system withWrite-throughstrategy

    - All memory accesses are handled by a centralized bus

    - Cache controllers ofall processors observe the bus, i.e. all write operations toglobal memory are recognized

    - Values of memory locations located in the local cache are updated

    Local caches contain up to date values

    (In the example above: P1 observes write operation from P3)Disadvantage: Enormous write transfers using the bus, because ofWrite-Through

    Example:

    Bus system with 200MHz a processor, which requires one cycle perinstruction (i.e. 200 million instructions per second)15% write operations 30 million write operations per secondPer write operation 8Byte (cacheline size) 240MB/sec Bus with 1GB/sec bandwidth can handle 4 processors

    Write-Back strategy with a suitable protocol is needed

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 25 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 26 / 54

    Write-Back Invalidate Protocol (MSI-Protocol)

    (for the Write-Back Strategy)

    - Three possible states for a cached memory block:

    M (modified): Only one processor holds current version of a cache block;Copies of the block in others caches and on the main memoryareinvalid

    S (shared): Block is unmodified and copies in different other caches may

    exist;all copies in other caches and main memory are valid

    I (invalid): Block in that individual cache is invalid

    Idea: - Processor Pwants to modify a cache block- All other copies of that cache block are marked invalid (I) (bus operation)- Modification of the cache block by P and marking it with modified (M)- Several memory operations on processor P are possible without using the bus

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 27 / 54

    Possible Bus Operations for MSI:

    a) Bus Read (BusRd):

    - Triggered by a read operation of a memory location, that is not locatedin thecache (read miss).

    - Cache-Controller requests the cache block by specifying the address.- Memory system supplies the block to the cache.

    b) Bus Read Exclusive (BusRdEx):

    - Triggered by a write operation to a memory location, that is not cached orthat is not loaded for modification (no (M)-Mark).

    - Cache-Controller requests the memory block by specifying the address of theblock as an exclusive copy. (The request is serviced from global memory orfrom other caches.)

    - Other copies are marked (I)

    c) Write Back (BusWr):

    - Triggered by a replacement of a cache block, because of another block isloaded.

    - Cache-Controller writes a block marked with (M) back into global memory.

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 28 / 54

  • 8/13/2019 Mc Main Cache en-nup

    8/14

    The processor performs read operations (PrRd) and write operations (PrWr) thecache-controller performs associated bus operations.

    BusRd, BusRdEx,

    BusWr

    Prozessor

    PrRd, PrWr

    CacheController

    Bus

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 29 / 54

    State transitions of MSI

    for cache blocks (processor-operation/cache-controller-operations)

    M

    S

    I

    BusRdEx/flush

    PrWr/BusRdEx BusRd/flush

    BusRd/

    PrRd/

    BusRdEx/

    PrWr/PrRd/

    PrWr/BusRdEx

    PrRd/BusRd

    value of the requested memory

    flush = cache controller puts the

    operation on the bus

    Disadvantage: If a processor reads a value and writes to it afterwards, two busoperations are necessary, even if no other processor accesses the value.

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 30 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 31 / 54

    MESI-Protocol

    (Variations in Intel Pentium, MIPS R4400, PowerPC)Introduction of an additional state (E) (exclusive)

    - Only the considered cache has a copy of the block and the global memorycontains the current value.

    Procedure:

    - The reading processor marks a block with (E), ifno other processor holds

    the block in cache- A write operation performed later on that block on the processor:

    If block is marked with (E) modification in (M), no bus operationIf block is marked with (S) steps of the MSI-protocol

    Write-Back-Update Protocol:

    After update of a block marked with (M) all other caches holding that block areupdated as well. Local caches store always the current value; increased bus traffic

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 32 / 54

  • 8/13/2019 Mc Main Cache en-nup

    9/14

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 33 / 54

    Cache-Coherence in Systems not using a Bus

    Cache-Coherencynot straightforward realizable, because no centralized mediumexists.

    1. Approach: no Hardware-Cache Coherence

    e.g. Cray T3D, T3E (physically distributed, logically sharedmemory)Local caches can only store values from the local memory;

    data located in local memories of remote processors can notbe stored in the cache

    Advantage: no additional hardware requiredDisadvantage: Data accesses from non-local memories areexpensive

    2. Approach: Directory-Protocol:A directory records the state of each memory block.

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 34 / 54

    A Example: (For a Directory Protocol)

    - Computer with a shared memory, that is physically distributed and pprocessors

    - For each local memory there is a table: for each cache block it is noted inwhich cache it is stored

    - Each cache block has a bit vector with p presence-bits and status bits

    Presence-Bit =

    1 cache contains valid copy0 cache contains invalid copy

    Status bit(dirty-Bit)

    =

    0 shared memory contains current version1 shared memory contains not current version

    - Management of each directory by a directory-controller

    - Processors access memory locations using a cache controller

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 35 / 54

    M

    C

    P

    Dir

    M

    C

    P

    Dir

    VN

    - A cache block in the local cache is marked with (M), (I) or (S).

    - On a cache read miss on processor Pi: Access the corresponding directory entry over the networka) dirty-bit =0:

    - Directory-Controller reads the memory block from global memory with a localaccess

    - Sending the values to the requesting cache controller using the network- Setting presence[i] = 1 in the bit vector for the block

    b) dirty-Bit=1:

    - Directory-Controller requests block from jwith presence[j] = 1- Cache-Controller jsends block to i and to the directory-controller of the

    memory block and changes (M) to (S)- Directory-Controller of the memory block sets dirty-Bit=0 and presence[i] = 1.

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 36 / 54

  • 8/13/2019 Mc Main Cache en-nup

    10/14

    - On a cache write miss on processorPi:a) dirty-Bit=0:

    - Directory-Controller sends invalidate-message to all processors j withpresence[j] = 1

    - After acknowledgement, block is send to i- presence[i] = 1, presence[j] = 0, j=i, dirty=1- On i the block is marked with M

    b) dirty-Bit=1:- Block is send from the processor with presence[j] = 1 to i- presence[j] = 0, presence[i] = 1, dirty=1 (remains)

    - The bit vectors are merged on a write back of the cache-block

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 37 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 38 / 54

    Cache-Coherency/Cache-Consistency

    Cache-Coherency Problem: Two processors may have different values in theircaches for the same memory location.

    Informal: A memory system is coherent, if each read operation of a variablereturns the lastwritten value.

    There are two different aspects of correct programming of systems withshared memory

    I) Coherency: Which value is returned from a read operation?II) Consistency: When is a previously written value read by a read operation?

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 39 / 54

    Coherency:

    1) Preservation of program order for a singleprocessor for a variable x2) Coherent Memory: A written value has to become visible for other processors

    (time delay)3) Write-Sequentialization of write operations of different processors to the same

    variableCoherency is maintained, but it is not specific, whena value written isvisible

    Coherency and Consistency are complementary:

    - Coherency deals with read and write accesses to the same memory location- Consistency describes the behavior of read- and write-operations in terms ofdifferentmemory locations.

    Assumptions regarding Coherency:

    - A write operation is not finished before the effect of the write operation isvisible to all other processors.

    - A processor does not cache the order of write operations and write operationsfrom other processors

    Changing the order of read operations is possible Write operations have to be finished in program order

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 40 / 54

  • 8/13/2019 Mc Main Cache en-nup

    11/14

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 41 / 54

    Memory Consistency

    Memory-/Cache-Coherency:Each processor gets the same unique view of memory, i.e. each processor getsthe same value at each time as other processors (if the program has such anoperation)There is no statement about the order, in which memory operations becomevisible.

    Memory consistencymodels care about the topic in which order a memoryaccess of a processor is made visible to other processors (Semantic,Correctness)

    There are different design aspects for memory consistency models:

    a) Are the memory accesses of the processors performed in program order?( visibility, finished ?)

    b) Are the memory operations made visible to other processors in the sameorder?

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 42 / 54

    Example:

    Three processors P1,P2,P3, with variables a,b, c initialized with zeroP1,P2,P3 execute a multiprocessor program, that uses the sharedvariable a, b, c:

    Processor P1 P2 P3Program (1) a = 1; (3) b= 1; (5) c= 1;

    (2)print b, c; (4) print a, c; (6) print a, b;

    Output of the multiprocessor program: 6 values, each variable can have value

    0 or 1;

    Arbitrary order of instructions results in: 26 = 64 possible outputs

    When using program order for each processor: the output 000000 isimpossible

    A possible order is e.g.:. (1)(2)(3)(4)(5)(6)

    Memory consistency models limit the possible executions orders.

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 43 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 44 / 54

  • 8/13/2019 Mc Main Cache en-nup

    12/14

    Sequential Consistency (Lamport79, SC)

    A multiprocessor system is sequential consistent, if:

    - Each processor does its memory operations in program order of its program.The values become visible in that order.

    - The overall effect of all memory operations (of all processors) is a orderwhich is achieved by interleaving all memory operations.

    Memory operations are assumed to be atomic, i.e. the effect b ecomes visible,before a operation is performed by any other processor.

    Program order means the order with respect to the source code.

    Total order of all memory operations of a program (stronger than coherency,because all memory operations are affected).

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 45 / 54

    For the above example:

    Possible outputs of the SC-Model: 001011111111

    but not 011001Coherency and Consistency are different:Example: Three processors P1,P2,P3 with the following program

    processor P1 P2 P3program (1)A = 1 (2) while(A== 0); (4) whileB== 0;

    (3)B= 1; (5) printA;P2 waits until A gets the value 1 and sets B= 1 afterwardsP3 waits until Bgets the value 1 and prints afterwards

    Output of SC model: order (1),(2),(3),(4),(5), and A = 1

    Sequentialization of write operations ofa variable and without atomicity

    (3) can become visible on P3 before(1) Output A = 0

    Clarification by showing the execution on a parallel system

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 46 / 54

    Clarification:

    Parallel system: Cache-Coherency using the invalidate-protocol on adirectory-basis

    Variable assignment at the beginning: A= 0,B= 0

    Cache state: block with A and block with Bare stored with state (S) in thecaches fromP2 and P3.

    Execution Steps:- Operations are performed for each processor in program order- A Memory operation is executed when the previous operation of the same

    processor has finished.

    No assumptions about run times of operations in the network

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 47 / 54

    Possible total execution order:a) P1 executes (1)

    Cache-Miss Directory-Access for A and invalidation message to P2 and P3

    b) P2 executes (2) and didnt even get the message

    Read-Miss Copy of the current value A = 1 is send

    Update in global memory

    P2 executes (3) Write-Miss because of state (S) Directory access for Band invalidation message for P1 and P3

    c) P

    3 executes (4) and invalidations message has arrived Read-Miss Current value ofBis send and update in global memory is performed

    P3 executes (5) and did not even get the invalidation message from P1 A= 0 is read from the cache ( Old Value)

    Sequential Consistency is hurt:P2 sees the order: A= 1;B= 1;P3 sees the order: B= 1;A= 1; (because the new value from B,

    but the old value ofA is read)

    Sequentialization of memory accesses is not enough,atomicity must be added.

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 48 / 54

  • 8/13/2019 Mc Main Cache en-nup

    13/14

    Sufficient Conditions for Sequential Consistency

    1) Each processor performs memory operations in program order(no out-of-order execution)

    2) After starting a memory operation, the execution processor waits until thatoperation has finished, in particular the (I)-Marks are set

    3) After performing a read operation the executing processor waits until theread operation has finished and that the write operation has finished whichvalue the read operation returns on all other processors.

    No assumptions are made about the cooperation of the processors, about theparameters of the network, or about the memory organization.

    Example:

    The condition 3) has the effect:After reading A, P2 waits until the corresponding Write-Operation (A= 1 for P1)has finished and performs the next memory operation (B= 1) afterwards P3 reads for both A and Beither old or new values.

    Advantage: SC has a simple programing modelDisadvantage: Atomicity can lead to inefficiency.

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 49 / 54

    Outline

    1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

    2 Cache CoherencyBus-Snooping

    Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

    3 Memory ConsistencySequential ConsistencyWeaker Consistency Models

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 50 / 54

    Weaker Consistency Models

    The SC-Model requires the following execution orders for read- andwrite-operations of a processor:

    1)R R Read access in program order2)RW Read- and Write access in program order

    (Anti-Dependence, ifsame memory locations)

    3)W W consecutive write accesses in program order(output dependence, ifsamememory locations )

    4)W R Write- and Read- Operation in program order(true dependence, ifsame memory locations)

    Weaker consistency models remove some of the the strict orders of SC .

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 51 / 54

    Processor-ConsistencyRemoves the requirement for the W R order,i.e. read operation Rcan be executed before a write operation Whas finished, ifthere is no data dependency.

    TSO (total store ordering)-Model : (SPARC-Processor from Sun)PC (processor consistency)-Model : (Intel Pentium-Processor)

    Term for the

    same principle

    Example:

    VariablesA,B are initialized with 0:

    Processor P1 P2Program (1) A = 1; (3) B= 1;

    (2)printB; (4) printA;

    SC-Model: always (1) before (2) and (3) before (4) Output A = 0 and B= 0 is not possible

    TSO-Model: OutputA = 0 and B= 0 is possible, because (3)must not be finished before (2) is executed.

    Note:

    - Correct behavior for synchronous programs

    - Hiding of write latencies is possible

    Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 52 / 54

  • 8/13/2019 Mc Main Cache en-nup

    14/14

    Partial-Store-Ordering (PSO)

    Removes the requirement for W R and W WWrite operations can become visible in a order different from program order Overlapping of write operations is possible (advantages on write misses)

    Example:

    VariablesA and flag are initialized with 0Processor P1 P2Program (1) A = 1; (3) while(flag==0) ;

    (2)flag= 1; (4) printA;

    SC-Model: OutputA = 0 is not possibleTSO-Model: OutputA = 0 is not possiblePSO-Model: flag=1; can be finished before A = 1

    Output A =0 is possible

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 53 / 54

    Weak-Ordering

    Furthermore removes the requirements R R and RW There is no guarantee on any order

    There are additional synchronization operations:

    a) All read- and write-operations performed before the synchronization

    operation (in program order) are finished before the synchronizationoperation is executed

    b) A synchronization operation is completed before any read- or write operationsfollowing the synchronization operation are executed

    Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 54 / 54