Cache Coherence Mechanisms (Research project) CSCI-5593 Prepared By Sultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri 1

1

Cache Coherence Mechanisms(Research project)

CSCI-5593

Prepared BySultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri

2

Outline:Introduction

Cache coherence problem

Cache Coherence Definition

Cache Coherence Solutions: Software Solutions

Hardware Solutions

Cache Coherence Mechanisms: Snoopy Protocol

Directory- based Protocol

Team’s Implementation Plan

3

Introduction Modern systems depend on using shared memory

multiprocessors to increase the speed of execution time.

Each processor has its own private cache.

Cache is very important since it is used to improve and speedup the processing time. That is because of read or writes that can be completed in just a few cycles by the CPU.

There might be multiple copies of same data in different caches.

The Big Question: How to keep all of those different copies of data consistent ?

4

Importance of Cache Regarding Performance

Cache minimizes the average latency Main memory access costs from 100 to 1000 cycles Cache reduces the latency down to a small number of cycles

Cache minimizes the average bandwidth required to access main memory Reduce access to shared bus or interconnect.

Cache allows for automatic migration of data Data is moved closer to processor

Cache automatically replicates the data Replication is done based upon need Processors can share data efficiently

But private caches can create a problem!!

5

The Cache Coherence Problem Imagine what would happen when different

processors read and write to same memory location?– Multiple copies of data in different processor’s caches

might cause having different values for the same data.

– When any processor modifies its copy of data, it might NOT become visible to others.

– This would result in the other processors having invalid value of the data in their caches

6

Example

Time

Event Cache-1 Cache-2 Memory

0 - - 1

1 Processor1 reads A 1 - 1

2 Processor2 reads A 1 1 1

3 Processor1 writes 0 in A 0 1 0

4 Processor2 reads A 0 Read wrong Value (1) 0

7

Cache Coherence

What Does Cache Coherence Mean?Cache coherence is the process of ensuring that

all shared data are consistent.

In order to get a correct execution, coherence must be enforced between the caches.

8

A memory is coherent ifthe following conditions fulfilled :

1. Write propagation: When any processor writes, time elapses, the written value must become visible to others.

2. Write serialization: If two processors want to write to the same location in the same time, they will be seen in the same order by all processors.

9

Cache Coherence

There are 4 primary design issues for the coherence mechanisms must be considered to get the best performance:1. Coherence detection strategy.

2. Coherence enforcement strategy.

3. How the precision of block-sharing information.

4. Cache block size.

10

Four primary design issues

1. Coherence detection strategy:• Incoherent memory accesses can be detected. • This could occur at run-time or compile-time.

2. Precision of block-sharing information: • How precise the coherence detection strategy will be. • There is a trade-off between performance and

implementation cost.

11

Four primary design issues(cont..)

3. Cache block size:• How the cache block size effects the performance of

the memory system.

4. Coherence Enforcement strategy: • To update or invalidate data to make sure that

”invalid data" will not be read by any processor.

12

Cache Coherence Solutions

1. Software Solutions: Software solutions rely on Operating system

and Compiler.

2. Hardware Solutions: We will Focus on hardware solutions because they are more common and related to our course.

13

Hardware Solutions

Two basic methods for Cache –Memory coherence:

1. Write backThe memory is updated only when the block in the cache is being replaced.

2. Write throughThe memory is updated every time the cache is updated.

14

Hardware Solutions

Two basic methods for cache –cache coherence:1. Write-invalidate: • The processor invalidates all copies of data blocks

to have complete access of the block

2. Write-update:• When a processor writes on the data blocks, all

copies of the block are also updated.

15

Cache Coherence Protocols

The methods explained in the last slide are most commonly used by the following mechanisms:

1. Snoop-based protocol

2. Directory-based protocol

16

Snoopy-based protocol Snoopy-based coherence protocol is a very popular

in multi-core system since it is simple and has low overhead.

Bus allows each processor to monitor all of the transactions to the shared memory.

A controller “snooper”: is used in each cache to response to other processors requests and bus.

Snooping protocol is fast when we have enough bandwidth, and provides low average miss latency.

17

Processor 1 Processor 4Processor 3Processor 2

caches

Main Memory

caches caches caches

Bus

snooper snooper snooper snooper

18

Cont..• All coherence transactions are broadcasted, so all

are seen by all other processors.

• In case cache snooper sees a write on the bus, it will invalidate the line out of its cache if it is present.

• In case a cache snooper sees a read request on the bus, it checks to see if it has the most recent copy of data, and if so, responds to the bus request.

19

Snoop-based protocol(cont..)

Two major methods are used by Snoop-based protocol:

1. Write-invalidate.

2. Write-update.

20


On the case of Write update Protocol: Write to the blocks that are sharing the data Then broadcast on bus and processors. Snoop and update all of the blocks copies. The memory is always kept freshly updated

This method is not preferred since it needs a broadcast for each step of write, which need more bandwidth and lead to more traffic.

21


On the case of Write Invalidate Protocol: It has one writer and many readers In order to write to shared data:

An invalidate is sent to all caches which snoop and invalidate any copies.

When Read Miss occurs: Write-back: snoop in caches to find most recent copy.

It is used in Most modern multicore systems since it has less bus traffic.

22

Some Snoop Cache Types“based on the block states”

Basic Protocol Modified, Shared and Invalid

Berkeley Protocol Owned Exclusive, Owned Shared, Shared and Invalid

Illinois Protocol Private Dirty, Private Clean, Shared and Invalid

MESI Protocol Modified, exclusive, Shared and Invalid

23

Snoopy-based protocol

Each block of main memory is in one state: Clean in all caches and up-to-date in memory (shared) Dirty in exactly one cache (exclusive) Un-cached Not in any cache

Each cache block can be in one of following states : Modified: Only the valid copy in any cache and its

value is different from the main memory copy. Shared: A valid copy, but other caches may also have it. Invalid: block has no valid data. Exclusive: This copy has not been modified yet but it is

the only valid copy in any cache.

24

Example 1:

If processor 1 wants to read block A : Read Hit:

If block A is in its own cache, there will be read hit. Read miss:

If block A is not in its own cache. Therefore, it will send broadcast to see:–If any other cache has valid copy of this block if so, it will get it from there.–If not, it will get it from the main memory as the following:

25


caches

Main Memory


If processor1 wants to read block A from main memory

A

Controller Bus

ReadA

S

A Clean

Read miss

26

Example 2:

If processor 3 wants to read block A and it is on cache of P1

27


caches

Main Memory


Example 2:If processor 3 wants to read block A and it is

on cache of P1

A S

ReadA

A S

Controller Bus

A Clean

Read miss

28

Example 3:

If processor 4 wants to write on block A and it is on cache of P1 and P3

29


caches

Main Memory


Example 3:If processor 4 wants to write on block A and it is on cache of P1 and P3

A

A AAS SI I M

Controller Bus

Dirty

WriteA

Write miss

30

Example 4:

If processor 3 has an invalid copy of block A and wants to read block A and there is modified copy of it on cache of P4

31


caches

Main Memory


Dirty

A I A I MA

Read miss

ReadA

A

A

S

Clean

SA

Write- BackTo main memory

Example 4:If processor 3 has an invalid copy

of block A and wants to read block A and there is modified

copy of it on cache of P4

32

Requests from the processor:Request Source Block state Action

Read hit Proc Shared/Exclusive Read data in cache

Read miss Proc Invalid Place read miss on bus

Read miss Proc Shared Conflict miss: place read miss on bus

Read miss Proc Modified write back block, place read miss on bus

Write hit Proc Exclusive Write data in cache

Write hit Proc Shared Broadcast on bus to invalidate other copies

Write miss Proc Invalid Place write miss on bus

Write miss Proc Shared Conflict miss: place write miss on bus

Write miss Proc Modified write back, place write miss on bus

33

Request Source Block state Action

Read miss Bus Shared No action; allow memory to respond

Read miss Bus Modified Place block on bus; change to shared

Write miss Bus Shared Invalidate block

Write miss Bus Modified Write back block; change to invalid

Requests from the bus:

34

Important Observations

If any processor now wants to write in its block, it has to upgrade its block state from shared to exclusive copy.

By write-back method, the main memory will be updated once the processor which has a modified copy wants to change its state to shared state.

35

Directory-based protocolEach processor (or cluster of processors) has its own memory

• The directory is also distributed along with the corresponding memory

Each processor has:– Fast access to its local memory

– Slower access to “remote memory which is located at other processors

The physical address is enough to determine the location of memory.

Processing nodes:• The nodes are connected with a scalable interconnect, resulting in

routing of the messages from sender to receiver instead of broadcasting.• Cannot snoop anymore, thus records of sharing state is now kept in the

directory in order to track them.

36

Directory-based protocol(Cont..)

Typically three processors involved:

– Local node: where a request creates.

– Home node: contains the memory location of an

address.

– Remote node: contains a copy of the cache block,

either exclusive or shared.

37


cache states:– Shared:

• At least one processor has cached data

• Memory is up-to-date

• Any processor is able to read the block

– exclusive:

• Only one processor (the owner) has the data cached.

• Memory will be staled.

• only that processor can write to it

– Invalid (Un-cached):

• No processor has the data cached.

Bit-vector use in order to keep tracking which processors have data in shared state or If it is exclusive in one processor.

38


Processor 4 & its Caches




I / O 3Memory

4

Directory

I / O 2Memory

3

Directory

I / O 1Memory

2

Directory

I / O 0Memory

1

Directory

Interconnection network

39


Assuming processor 1 wants to read block A and from the address of the block “2.5 GB”:

1. The processor1 will recognize that block A is in the memory of processor 3.

2. The processor 1 will send request to the node 3.

3. The directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block

40





I / O 3Memory

4

Directory

I / O 2Memory

3

Directory

I / O 1Memory

2

Directory

I / O 0Memory

1

Directory


Read AAddress 2.5 GB

Read miss

AA

S

S: p1

Then send a copy of A to P1 then put the block in shared state and

keep tracking it

Then the directory of node 3 will check the state of this block and

make sure it is in the shared state

If P1 wants to read block A and from the address of the block “2.5 GB” the processor recognizes that it is in the memory of processor 3, so the

processor 1 will send request to the node 3

41

Assuming now processor 2 wants to read block A and from the address of the block “2.5 GB”:

1. The processor will recognize that it is in the memory of processor 3, also it is in shared state with processor 1.

2. The processor 2 will send request to the node 3.

3. Then the directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block


42





I / O 3Memory

4

Directory

I / O 2Memory

3

Directory

I / O 1Memory

2

Directory

I / O 0Memory

1

Directory


S

S: p1

A

A


Read miss

A

Directory check the state of the block

Then will put the block in shared state and keep

tracking it

, P2

S

43

• Example 3:

Assuming now processor 4 wants to WRITE in block A and from the address of the block “2.5 GB”1. The processor recognizes that it is in the memory of processor 3, also it is in shared state with processor 1, and 2.2. The processor 4 will send request to the node 3.

44

3. Then the directory of node 3 will check the state of this block and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACK since there is no Bus used here.

4. The directory will be updated by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block.

45





I / O 3Memory

4

Directory

I / O 2Memory

3

Directory

I / O 1Memory

2

Directory

I / O 0Memory

1

Directory


Write miss

SA S Write AAddress 2.5 GBAI I

S: p1, P2

If processor 4 wants to WRITE in block A and from the address of the black

“2.5 GB” the processor recognizes that it is in the memory of node 3

So, the processor 4 will send request to the node 3

Then the directory of node 3 will check the state of this black and make sure it is in the shared state after that will send node to node request to P1 and P2 to change

the state of A from share to invalid and wait for ACK

The directory will update its by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block.

Node to node message to P1 to change the state

Node to node message to P2 to change the state

E: p4

ACK ACK

A

E

46

Example 4:

Assuming now processor 1 wants to READ block A BUT its copy is invalid . So, from the address of the block “2.5 GB”

1. The processor recognizes that it is in the memory of processor 3, BUT it is in Exclusive state with processor 4, so the processor 1 will send request to the node 3.

2. Then the directory of node 3 will check the state of this block and find out it is in Exclusive state with p4

47

3. So, node 3 will forward the request to node 4 which will change the block state to shared and by write back technique it will update the memory of node 3 by the updated copy of block A.

4. After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node )

5. Finally, the directory of node 3 will update its table and keep tracking of this block.

48





I / O 3Memory

4

Directory

I / O 2Memory

3

Directory

I / O 1Memory

2

Directory

I / O 0Memory

1

Directory


A AI I

E: p4

MA

processor 1 wants to read block A BUT its copy is invalid . 3, so the processor 1 will send request to the

node 3 which will check the state of this black and find out it is in Exclusive state with p4

Read miss


Node 3 will forward the request to node 4 which will change the block state to shared and by write back technique will update the memory of node 3 by the

updated copy of block A.

Read A for P1Address 2.5 GB

SA

A

After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the

node 1 (Local node )

A

S

Finally, the directory of node 3 will update its table and keep tracking of this block.

S: p1,P4

49

Directory Actions If block is in un-cached state:

Read miss: send data, make block shared Write miss: send data, make block exclusive

If block is in shared state: Read miss: send data, add node to sharers list Write miss: send data, invalidate sharers, make exclusive

If block is in exclusive state: Read miss: ask owner for data, write-back to memory, send

data, make shared, add node to sharers list Data write back: write to memory, make un-cached Write miss: ask owner for data, write to memory, send data,

update identity of new owner, remain exclusive

50

Snoopy-Based Advantages and Disadvantages

Adv: The average miss latency is low, especially for cache-to-cache misses. In case of having small number of processors, snoopy will be fast.

Dis: The cache coherence overhead and the speed of shared buses limit the

bandwidth needed to broadcast messages to all processors. For large systems, it is not scale since each request will be broadcasted to

all processors. Buses have limitations for scalability:

o Physical (number of devices that can be attached)o Performance (contention on a shared resource: the bus)

51

Directory-Based Advantages and Disadvantages:

Adv: The scale much better than snoopy protocols (no

broadcast required ). It can exploit random point-to-point interconnects

Dis: The directory access and the extra interconnect traversal

is on the critical path of cache to cache misses. The latency here is longer than snoopy protocol since

there are 3 hops (request, response, forward).

52

Observation study:

• Snoopy based protocol outperforms directory based in case of high bandwidth.

• As the number of processors are increasing, directory based outperforms snoopy based protocol [5].

53

54

55

Our Implementation Plan

We will implement these two schemes:

• Snoopy-based protocol

• Directory-based protocol

Also we will simulate the following:

• Cores

• Local caches

• Memory Access Patterns

56

Cont.. Our Implementation Plan

In this implementation, the following parameters will be

considered in order to deeply understand and see how the change

of these parameters might affect the performance of each scheme:

• Number of processors

• Cache/Block size

• Applied Coherence Protocol

Also, the collected results will be including quantities of hits and

misses for each cache level

In this project, we are going to classify the misses’ type as

compulsory miss, capacity miss, or conflict miss.

57

Thanks a lot…

58

References:1. J. Hennessy, D. Patterson. Computer Architecture: A Quantitative Approach (5th

ed.). Morgan Kaufmann, 2011.2. Hashemi, B., "Simulation and Evaluation Snoopy Cache Coherence Protocols with

Update Strategy in Shared Memory Multiprocessor Systems," Parallel and Distributed Processing with Applications Workshops (ISPAW), 2011 Ninth IEEE International Symposium on , pp.256,259, 26-28 May 2011

3. Ahmed, R.E.; Dhodhi, M.K., "Directory-based cache coherence protocol for power-aware chip-multiprocessors," Electrical and Computer Engineering (CCECE), 2011 24th Canadian Conference, pp.001036, 001039, 8-11 May 2011.

4. Emil Gustafsson and Bruno Nilbert,”cache coherence in parallel Multiprocessors”, Uppsala 24th February 1997, Department of computer science, Uppsala university 1997.

5. Milo M. K. Martin, Daniel J. Sorin, Mark D. Hill, and David A.: " Bandwidth Adaptive Snooping," 8th Annual International Symposium on High-Performance Computer Architecture (HPCA-8). (2002) 2-6

Documents

Cache Coherence Mechanisms (Research project) CSCI-5593 Prepared By Sultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri 1