Mc Main Cache en-nup

8/13/2019 Mc Main Cache en-nup

1/14

Multi-Core Programming -Caches and Memory Hierarchies

Prof. Dr. Gudula Runger

Professur Praktische InformatikTechnische U niversi tat Ch emnitz

Winter Term 2013/2014

Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 1 / 54

Outline

1 CachesCache AssociativityBlock Replacement StrategiesWrite Back Strategies

2 Cache CoherencyBus-Snooping

Write-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency

3 Memory ConsistencySequential ConsistencyWeaker Consistency Models


Key characteristics of Hardware Development

Growth of DRAM capacity (60% per year) (1980-98)

Growth of processor performance (55-80% per year)

Reduction of DRAM access time approx. 25 % per year

Introduction ofcachesand memory hierarchiesbetween processor and main memory

Prof Dr Gudula Runger Multicore Programming Winter Term 2013/2014 3 / 54

Cache Memory

SRAM (Static Random Access Memory)-Chip (0.5ns5ns)instead of 45ns70ns when using DRAM (ns=nano second=109sec)

Reload strategy of data blocks (cache block, cacheline) of constant size fromthe main memory into the cache

HS C Pblock word

Data is accessed over the cache:

Cache hit orCache miss



2/14

Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 3 / 54 Prof. Dr. Gudula Runger Multicore Programming Winter Term 2013/2014 4 / 54

Locality of Memory Accesses

Spatial Locality

Neighboring (spatial) accesses to memory locations of main memory atconsecutive points in time during program execution

Some of the successive memory accesses will access the same cache line and nofurther reloading from memory is necessary (the use of cache lines comprisingseveral memory words is based on the assumption that most programs exhibit

spatial locality).

Temporal Locality

The samememory location is accessed multiple times at consecutive points intime during program execution

After loading the memory words of the cache block are accessed multiple timesbefore the cache block is replaced again.


Cache Characteristics

Cache line size

The size of the block of consecutive memory locations which are loaded as a whole. (Idea:Spatial locality of the program is assumed, i.e. access to a memory location at time t, afterwards

access to neighboring memory location at time t+ 1) typical size: 48 words (3264 bytes)

Cache size

Number of Bytes a cache is able to store typical size: 8128 memory words

Cache associativity

Mapping of cache lines to memory blocks The associativity determines at how manypositions in the cache a memory block can be stored.

a) Direct-mapped cache:Each memory block can be stored at exactly one position in the cache

b) Fully associative cache:Each memory block can be stored at an arbitrary location in the cache

c) Set associative cache:Each memory block can be stored at a fixed numberof positions in the cache


Outline


2 Cache CoherencyBus-SnoopingWrite-Back Invalidate Protocol (MSI-Protocol)MESI ProtocolCache-Coherence in Systems not using a BusMultiprocessor-Cache-Coherency



Cache Associativity

Main memory: n= 2s blocks Bj,j= 0, . . . , n 1Cache memory: m= 2r cache lines Bi, i= 0, . . . ,m 1

Memory block, cache block: l= 2w

memory wordsTag for identifying a cached memory block depends on cache methods



3/14

Direct Mapped CacheExactly one one cache line for a memory block.

E.g. Bj Bi with i= jmod m, i.ecache block memory block

0 0,m, 2m, . . . , 2sm1 1,m+ 1 . . . , 2sm+ 1...

...m 1 m 1, . . . 2s 1

n/m= 2sr

memory blocks for each cache blockAccess to a memory location using the address of the valuememory address = block address + word address

(address of the (relative addressmemory block) in the memory block)s Bits w Bits

(identifies 2s blocks) (identifies 2w words)

sr

r

w

tag address of the

cache block


{

000001010011100101110111

cache block

differenttags

tag{

samecache block

Memory access (steps):

- Identifying the memory block (r rightmost bits of the leftmost s-bit word)

- Comparison of tags (s r leftmost bits) with the s rleftmost bits of thememory address:

match cache hit no match cache miss, reloading

Disadvantage:

- Only one position per memory block in the cache It is possible that memory blocks for one cache position are continuallyloaded and replaced in the cache (trashing)


Fully-Associative CacheEach memory block can be placed in any cache position.(Overcoming the disadvantage of direct mapped caches.)

memory address = block address + word address (2s blocks of(s Bits) (w Bits) 2w words)

s

w

whole block addressis used as tag

Memory Access: All tag-entries in the cache must be searched.

Advantage: - increased flexibility when loading memory blocks into cacheDisadvantage: - Expensive Search (complex circuit or time delays)

Example:

Cache of size 64KByte, cache block size is 4Byte (w=2) 16 K = 214

blocks within the cache (r=14)size of main memory 16 MBytes = 224 222 blocks (s=22)

Tags are 22 Bit For each 32-Bit-Block (4Bytes) a 22Bit tag has to be stored.

In comparison a direct-mapped cache: r=14 s r= 22 14 = 8i.e. tags of size 8Bit and the cache block must not be searched)


Set-Associative Cache

The cache is partitioned into v sets S0. . .Sv1(v= 2d), each set consists

ofk= mv

cache blocks.

Mapping of memory blocks to sets:

Bj Si with i= j mod v, i .e. set memory block0 0, v, 2v, . . . , 2s v1 1, v+ 1 . . . , 2s v+ 1

... ...v 1 v 1, . . . 2s 1

A memory block can be placed in any cache block of its set.

memory address = block address + word address(sBits) (w Bits)

sd

d

w

Tag Set Si



4/14

000001010011100

111

101

{

110

00 010 0

11

{{

{ {

Set S

Set S1

0 {4 cache blocks

Tag SetTag Sets S and S0 1

all memory block

adresses

Memory Access(steps):

- Identifying the set Si for block Bj

- Comparisons of the memory address (s dleftmost bits) with the tags of thecache blocks within the set Si

Special cases:

1) v= m and k= 1: set-associative cache = direct mapped cache

2) v= 1 and k= m: set-associative cache = fully associative cache

3) v= m2 and k= 2: 2-way set-associative

4) v= m4 and k= 4: 4-way set-associative


Outline






Block Replacement Strategies

a) Direct-mapped cache:Only one cache position is used, which is replaced with the new value.

b)c) Fully-associative cache or set-associative cache:

- LRU-Replacement Strategy (Least Recently Used):

For a two-way set-associative cache a additional USE-Bit is used,i.e. set USE-Bit=1 if the first cache block was referenced and set USE-Bit=0for the second cache block of the set

- LFU-Replacement strategy (Least FrequentlyUsed)The number of accesses is used.

- Random


Outline






5/14

Write Back StrategiesWrite operation on cache hit:

Write operations change values in the cache;

Reading correct values on read access must be ensured.

The modified value has to be updated in main memory, too;

Write back Strategies determine the time and the strategy of updatingthe main memory:

Write-through StrategyWrite-back Strategy

Write-through Strategy

After a write operation in the cache has been performed, the correspondingmemory operation has to be performed in main memory directly (write-through). Cache blocks and corresponding memory blocks have always the same value I/O-Peripherals or additional processors have always a recent view of thememory.Problem: Write back operation takes a long time write stall for the processorRemedy: Usage of write buffers


Write-back Strategy

Write operations are at first performed in the cache.When replacing the modified cache block the corresponding memory block isupdated (write back).

Cache block and corresponding memory block possibly contain differentvalues.

Realization of the Write-back strategy:

A dirty-Bit describes, if a cache block was modified.

Only modified cache blocks must be written back.

Advantage: Less main memory operations

Disadvantage: Main memory may contain invalid values I/O-Peripheralscan not access the main memory directly but have to access the cache

Write operations on a cache access:

a) Write allocate (fetch on write): A memory block is fetched from mainmemory and treated like a cache hit afterwards.

b) No-write allocate (write around): On cache miss a memory block is modifiedonly modified on main memory.


Outline





Cache Coherency

Situation: Multiprocessor with a local cache for each processor

Memory Coherency Problem

The same memory block is potentially located in different local caches at thesame time. After modificationlocal caches and the main memory may containdifferent inconsistent values for the same memory location.



6/14

Example for Cache Coherency

- Bus based SMP-System with processor Piand local cache Ci, i= 1, 2, 3

- Write-through strategy

- Variable u located in memory Mhas value 5

time actiont

1 P

1 reads variable u block with uwill be loaded into C

1t2 P3 reads variable u block with uwill be loaded into C3t3 P3 writes 7 into u changing the value in memory Mt4 P1 reads variable u P1 and gets value 5, because uis cached.

(If using write-back strategy, P1 will read 5 as well)t5 P2 reads variable u P2 reads the value 7 from M

(If using write-back strategy, P2 will read the value 5).


Coherency of Memory

Describes the behavior on read- and write-accesses to the same memorylocation fromseveral processors of a multi processor system.(CoherencyContext; logical, argumentative Coherence)

Informal: A memory system is coherent, if for each memory location a readaccess return the last value written.

Which value is the last value written in a multi-processor system?

(Processors may write at the same time to a memory location)Clarifying the Notation of Coherence

- As a measure of time not the time of a physical read or write is used, butthe position of the operation in the program order.

- Without local caches the memory system is coherent, because of the mainmemory leads to a sequentialization of memory accesses to a memorylocation and the memory accesses are done in program order. (writesequentialization)


Notation of Coherence

A memory system is coherent, if the following conditions are true:

(1) If processor Pwrites to memory location xat time t1 and reads xat timet2> t1 and if no other processor writes to xin the interval between t1 andt2, than Preads at time t2 the value written at time t1.(i.e. for each processor of the parallel system the program order ismaintained)(Counterexample: Write-back Strategy and at time t0< t1 a processor P

writes to x, than the cache block gets replaced at time t1> t

1)

(2) If processor P1 writes to memory location xat time t1 and processor P2 readsmemory location xat time t2 and ifno other processor writes to x in theinterval between t1 and t2 and the interval t2 t1 is long enough, then P2reads the value written from P1(after a certain amount of time the new value is visible)

(3) If any two processors write the same memory location x, then the writeaccesses are sequentialized, i.e. all processors realize the write accesses inthe same order. (global sequentialization)


Outline






7/14

Bus-Snooping

- Possible realization of cache coherency in a bus based SMP system withWrite-throughstrategy

- All memory accesses are handled by a centralized bus

- Cache controllers ofall processors observe the bus, i.e. all write operations toglobal memory are recognized

- Values of memory locations located in the local cache are updated

Local caches contain up to date values

(In the example above: P1 observes write operation from P3)Disadvantage: Enormous write transfers using the bus, because ofWrite-Through

Example:

Bus system with 200MHz a processor, which requires one cycle perinstruction (i.e. 200 million instructions per second)15% write operations 30 million write operations per secondPer write operation 8Byte (cacheline size) 240MB/sec Bus with 1GB/sec bandwidth can handle 4 processors

Write-Back strategy with a suitable protocol is needed


Outline






Write-Back Invalidate Protocol (MSI-Protocol)

(for the Write-Back Strategy)

- Three possible states for a cached memory block:

M (modified): Only one processor holds current version of a cache block;Copies of the block in others caches and on the main memoryareinvalid

S (shared): Block is unmodified and copies in different other caches may

exist;all copies in other caches and main memory are valid

I (invalid): Block in that individual cache is invalid

Idea: - Processor Pwants to modify a cache block- All other copies of that cache block are marked invalid (I) (bus operation)- Modification of the cache block by P and marking it with modified (M)- Several memory operations on processor P are possible without using the bus


Possible Bus Operations for MSI:

a) Bus Read (BusRd):

- Triggered by a read operation of a memory location, that is not locatedin thecache (read miss).

- Cache-Controller requests the cache block by specifying the address.- Memory system supplies the block to the cache.

b) Bus Read Exclusive (BusRdEx):

- Triggered by a write operation to a memory location, that is not cached orthat is not loaded for modification (no (M)-Mark).

- Cache-Controller requests the memory block by specifying the address of theblock as an exclusive copy. (The request is serviced from global memory orfrom other caches.)

- Other copies are marked (I)

c) Write Back (BusWr):

- Triggered by a replacement of a cache block, because of another block isloaded.

- Cache-Controller writes a block marked with (M) back into global memory.



8/14

The processor performs read operations (PrRd) and write operations (PrWr) thecache-controller performs associated bus operations.

BusRd, BusRdEx,

BusWr

Prozessor

PrRd, PrWr

CacheController

Bus


State transitions of MSI

for cache blocks (processor-operation/cache-controller-operations)

M

S

I

BusRdEx/flush

PrWr/BusRdEx BusRd/flush

BusRd/

PrRd/

BusRdEx/

PrWr/PrRd/

PrWr/BusRdEx

PrRd/BusRd

value of the requested memory

flush = cache controller puts the

operation on the bus

Disadvantage: If a processor reads a value and writes to it afterwards, two busoperations are necessary, even if no other processor accesses the value.


Outline






MESI-Protocol

(Variations in Intel Pentium, MIPS R4400, PowerPC)Introduction of an additional state (E) (exclusive)

- Only the considered cache has a copy of the block and the global memorycontains the current value.

Procedure:

- The reading processor marks a block with (E), ifno other processor holds

the block in cache- A write operation performed later on that block on the processor:

If block is marked with (E) modification in (M), no bus operationIf block is marked with (S) steps of the MSI-protocol

Write-Back-Update Protocol:

After update of a block marked with (M) all other caches holding that block areupdated as well. Local caches store always the current value; increased bus traffic



9/14

Outline






Cache-Coherence in Systems not using a Bus

Cache-Coherencynot straightforward realizable, because no centralized mediumexists.

1. Approach: no Hardware-Cache Coherence

e.g. Cray T3D, T3E (physically distributed, logically sharedmemory)Local caches can only store values from the local memory;

data located in local memories of remote processors can notbe stored in the cache

Advantage: no additional hardware requiredDisadvantage: Data accesses from non-local memories areexpensive

2. Approach: Directory-Protocol:A directory records the state of each memory block.


A Example: (For a Directory Protocol)

- Computer with a shared memory, that is physically distributed and pprocessors

- For each local memory there is a table: for each cache block it is noted inwhich cache it is stored

- Each cache block has a bit vector with p presence-bits and status bits

Presence-Bit =

1 cache contains valid copy0 cache contains invalid copy

Status bit(dirty-Bit)

=

0 shared memory contains current version1 shared memory contains not current version

- Management of each directory by a directory-controller

- Processors access memory locations using a cache controller


M

C

P

Dir

M

C

P

Dir

VN

- A cache block in the local cache is marked with (M), (I) or (S).

- On a cache read miss on processor Pi: Access the corresponding directory entry over the networka) dirty-bit =0:

- Directory-Controller reads the memory block from global memory with a localaccess

- Sending the values to the requesting cache controller using the network- Setting presence[i] = 1 in the bit vector for the block

b) dirty-Bit=1:

- Directory-Controller requests block from jwith presence[j] = 1- Cache-Controller jsends block to i and to the directory-controller of the

memory block and changes (M) to (S)- Directory-Controller of the memory block sets dirty-Bit=0 and presence[i] = 1.



10/14

- On a cache write miss on processorPi:a) dirty-Bit=0:

- Directory-Controller sends invalidate-message to all processors j withpresence[j] = 1

- After acknowledgement, block is send to i- presence[i] = 1, presence[j] = 0, j=i, dirty=1- On i the block is marked with M

b) dirty-Bit=1:- Block is send from the processor with presence[j] = 1 to i- presence[j] = 0, presence[i] = 1, dirty=1 (remains)

- The bit vectors are merged on a write back of the cache-block


Outline






Cache-Coherency/Cache-Consistency

Cache-Coherency Problem: Two processors may have different values in theircaches for the same memory location.

Informal: A memory system is coherent, if each read operation of a variablereturns the lastwritten value.

There are two different aspects of correct programming of systems withshared memory

I) Coherency: Which value is returned from a read operation?II) Consistency: When is a previously written value read by a read operation?


Coherency:

1) Preservation of program order for a singleprocessor for a variable x2) Coherent Memory: A written value has to become visible for other processors

(time delay)3) Write-Sequentialization of write operations of different processors to the same

variableCoherency is maintained, but it is not specific, whena value written isvisible

Coherency and Consistency are complementary:

- Coherency deals with read and write accesses to the same memory location- Consistency describes the behavior of read- and write-operations in terms ofdifferentmemory locations.

Assumptions regarding Coherency:

- A write operation is not finished before the effect of the write operation isvisible to all other processors.

- A processor does not cache the order of write operations and write operationsfrom other processors

Changing the order of read operations is possible Write operations have to be finished in program order



11/14

Outline






Memory Consistency

Memory-/Cache-Coherency:Each processor gets the same unique view of memory, i.e. each processor getsthe same value at each time as other processors (if the program has such anoperation)There is no statement about the order, in which memory operations becomevisible.

Memory consistencymodels care about the topic in which order a memoryaccess of a processor is made visible to other processors (Semantic,Correctness)

There are different design aspects for memory consistency models:

a) Are the memory accesses of the processors performed in program order?( visibility, finished ?)

b) Are the memory operations made visible to other processors in the sameorder?


Example:

Three processors P1,P2,P3, with variables a,b, c initialized with zeroP1,P2,P3 execute a multiprocessor program, that uses the sharedvariable a, b, c:

Processor P1 P2 P3Program (1) a = 1; (3) b= 1; (5) c= 1;

(2)print b, c; (4) print a, c; (6) print a, b;

Output of the multiprocessor program: 6 values, each variable can have value

0 or 1;

Arbitrary order of instructions results in: 26 = 64 possible outputs

When using program order for each processor: the output 000000 isimpossible

A possible order is e.g.:. (1)(2)(3)(4)(5)(6)

Memory consistency models limit the possible executions orders.


Outline







12/14

Sequential Consistency (Lamport79, SC)

A multiprocessor system is sequential consistent, if:

- Each processor does its memory operations in program order of its program.The values become visible in that order.

- The overall effect of all memory operations (of all processors) is a orderwhich is achieved by interleaving all memory operations.

Memory operations are assumed to be atomic, i.e. the effect b ecomes visible,before a operation is performed by any other processor.

Program order means the order with respect to the source code.

Total order of all memory operations of a program (stronger than coherency,because all memory operations are affected).


For the above example:

Possible outputs of the SC-Model: 001011111111

but not 011001Coherency and Consistency are different:Example: Three processors P1,P2,P3 with the following program

processor P1 P2 P3program (1)A = 1 (2) while(A== 0); (4) whileB== 0;

(3)B= 1; (5) printA;P2 waits until A gets the value 1 and sets B= 1 afterwardsP3 waits until Bgets the value 1 and prints afterwards

Output of SC model: order (1),(2),(3),(4),(5), and A = 1

Sequentialization of write operations ofa variable and without atomicity

(3) can become visible on P3 before(1) Output A = 0

Clarification by showing the execution on a parallel system


Clarification:

Parallel system: Cache-Coherency using the invalidate-protocol on adirectory-basis

Variable assignment at the beginning: A= 0,B= 0

Cache state: block with A and block with Bare stored with state (S) in thecaches fromP2 and P3.

Execution Steps:- Operations are performed for each processor in program order- A Memory operation is executed when the previous operation of the same

processor has finished.

No assumptions about run times of operations in the network


Possible total execution order:a) P1 executes (1)

Cache-Miss Directory-Access for A and invalidation message to P2 and P3

b) P2 executes (2) and didnt even get the message

Read-Miss Copy of the current value A = 1 is send

Update in global memory

P2 executes (3) Write-Miss because of state (S) Directory access for Band invalidation message for P1 and P3

c) P

3 executes (4) and invalidations message has arrived Read-Miss Current value ofBis send and update in global memory is performed

P3 executes (5) and did not even get the invalidation message from P1 A= 0 is read from the cache ( Old Value)

Sequential Consistency is hurt:P2 sees the order: A= 1;B= 1;P3 sees the order: B= 1;A= 1; (because the new value from B,

but the old value ofA is read)

Sequentialization of memory accesses is not enough,atomicity must be added.



13/14

Sufficient Conditions for Sequential Consistency

1) Each processor performs memory operations in program order(no out-of-order execution)

2) After starting a memory operation, the execution processor waits until thatoperation has finished, in particular the (I)-Marks are set

3) After performing a read operation the executing processor waits until theread operation has finished and that the write operation has finished whichvalue the read operation returns on all other processors.

No assumptions are made about the cooperation of the processors, about theparameters of the network, or about the memory organization.

Example:

The condition 3) has the effect:After reading A, P2 waits until the corresponding Write-Operation (A= 1 for P1)has finished and performs the next memory operation (B= 1) afterwards P3 reads for both A and Beither old or new values.

Advantage: SC has a simple programing modelDisadvantage: Atomicity can lead to inefficiency.


Outline






Weaker Consistency Models

The SC-Model requires the following execution orders for read- andwrite-operations of a processor:

1)R R Read access in program order2)RW Read- and Write access in program order

(Anti-Dependence, ifsame memory locations)

3)W W consecutive write accesses in program order(output dependence, ifsamememory locations )

4)W R Write- and Read- Operation in program order(true dependence, ifsame memory locations)

Weaker consistency models remove some of the the strict orders of SC .


Processor-ConsistencyRemoves the requirement for the W R order,i.e. read operation Rcan be executed before a write operation Whas finished, ifthere is no data dependency.

TSO (total store ordering)-Model : (SPARC-Processor from Sun)PC (processor consistency)-Model : (Intel Pentium-Processor)

Term for the

same principle

Example:

VariablesA,B are initialized with 0:

Processor P1 P2Program (1) A = 1; (3) B= 1;

(2)printB; (4) printA;

SC-Model: always (1) before (2) and (3) before (4) Output A = 0 and B= 0 is not possible

TSO-Model: OutputA = 0 and B= 0 is possible, because (3)must not be finished before (2) is executed.

Note:

- Correct behavior for synchronous programs

- Hiding of write latencies is possible



14/14

Partial-Store-Ordering (PSO)

Removes the requirement for W R and W WWrite operations can become visible in a order different from program order Overlapping of write operations is possible (advantages on write misses)

Example:

VariablesA and flag are initialized with 0Processor P1 P2Program (1) A = 1; (3) while(flag==0) ;

(2)flag= 1; (4) printA;

SC-Model: OutputA = 0 is not possibleTSO-Model: OutputA = 0 is not possiblePSO-Model: flag=1; can be finished before A = 1

Output A =0 is possible


Weak-Ordering

Furthermore removes the requirements R R and RW There is no guarantee on any order

There are additional synchronization operations:

a) All read- and write-operations performed before the synchronization

operation (in program order) are finished before the synchronizationoperation is executed

b) A synchronization operation is completed before any read- or write operationsfollowing the synchronization operation are executed


Documents

Mc Main Cache en-nup