Upload
pranav-mckenna
View
232
Download
1
Tags:
Embed Size (px)
Citation preview
1
Cache Coherence Mechanisms(Research project)
CSCI-5593
Prepared BySultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri
2
Outline:Introduction
Cache coherence problem
Cache Coherence Definition
Cache Coherence Solutions: Software Solutions
Hardware Solutions
Cache Coherence Mechanisms: Snoopy Protocol
Directory- based Protocol
Team’s Implementation Plan
3
Introduction Modern systems depend on using shared memory
multiprocessors to increase the speed of execution time.
Each processor has its own private cache.
Cache is very important since it is used to improve and speedup the processing time. That is because of read or writes that can be completed in just a few cycles by the CPU.
There might be multiple copies of same data in different caches.
The Big Question: How to keep all of those different copies of data consistent ?
4
Importance of Cache Regarding Performance
Cache minimizes the average latency Main memory access costs from 100 to 1000 cycles Cache reduces the latency down to a small number of cycles
Cache minimizes the average bandwidth required to access main memory Reduce access to shared bus or interconnect.
Cache allows for automatic migration of data Data is moved closer to processor
Cache automatically replicates the data Replication is done based upon need Processors can share data efficiently
But private caches can create a problem!!
5
The Cache Coherence Problem Imagine what would happen when different
processors read and write to same memory location?– Multiple copies of data in different processor’s caches
might cause having different values for the same data.
– When any processor modifies its copy of data, it might NOT become visible to others.
– This would result in the other processors having invalid value of the data in their caches
6
Example
Time
Event Cache-1 Cache-2 Memory
0 - - 1
1 Processor1 reads A 1 - 1
2 Processor2 reads A 1 1 1
3 Processor1 writes 0 in A 0 1 0
4 Processor2 reads A 0 Read wrong Value (1) 0
7
Cache Coherence
What Does Cache Coherence Mean?Cache coherence is the process of ensuring that
all shared data are consistent.
In order to get a correct execution, coherence must be enforced between the caches.
8
A memory is coherent ifthe following conditions fulfilled :
1. Write propagation: When any processor writes, time elapses, the written value must become visible to others.
2. Write serialization: If two processors want to write to the same location in the same time, they will be seen in the same order by all processors.
9
Cache Coherence
There are 4 primary design issues for the coherence mechanisms must be considered to get the best performance:1. Coherence detection strategy.
2. Coherence enforcement strategy.
3. How the precision of block-sharing information.
4. Cache block size.
10
Four primary design issues
1. Coherence detection strategy:• Incoherent memory accesses can be detected. • This could occur at run-time or compile-time.
2. Precision of block-sharing information: • How precise the coherence detection strategy will be. • There is a trade-off between performance and
implementation cost.
11
Four primary design issues(cont..)
3. Cache block size:• How the cache block size effects the performance of
the memory system.
4. Coherence Enforcement strategy: • To update or invalidate data to make sure that
”invalid data" will not be read by any processor.
12
Cache Coherence Solutions
1. Software Solutions: Software solutions rely on Operating system
and Compiler.
2. Hardware Solutions: We will Focus on hardware solutions because they are more common and related to our course.
13
Hardware Solutions
Two basic methods for Cache –Memory coherence:
1. Write backThe memory is updated only when the block in the cache is being replaced.
2. Write throughThe memory is updated every time the cache is updated.
14
Hardware Solutions
Two basic methods for cache –cache coherence:1. Write-invalidate: • The processor invalidates all copies of data blocks
to have complete access of the block
2. Write-update:• When a processor writes on the data blocks, all
copies of the block are also updated.
15
Cache Coherence Protocols
The methods explained in the last slide are most commonly used by the following mechanisms:
1. Snoop-based protocol
2. Directory-based protocol
16
Snoopy-based protocol Snoopy-based coherence protocol is a very popular
in multi-core system since it is simple and has low overhead.
Bus allows each processor to monitor all of the transactions to the shared memory.
A controller “snooper”: is used in each cache to response to other processors requests and bus.
Snooping protocol is fast when we have enough bandwidth, and provides low average miss latency.
17
Processor 1 Processor 4Processor 3Processor 2
caches
Main Memory
caches caches caches
Bus
snooper snooper snooper snooper
18
Cont..• All coherence transactions are broadcasted, so all
are seen by all other processors.
• In case cache snooper sees a write on the bus, it will invalidate the line out of its cache if it is present.
• In case a cache snooper sees a read request on the bus, it checks to see if it has the most recent copy of data, and if so, responds to the bus request.
19
Snoop-based protocol(cont..)
Two major methods are used by Snoop-based protocol:
1. Write-invalidate.
2. Write-update.
20
Snoop-based protocol(cont..)
On the case of Write update Protocol: Write to the blocks that are sharing the data Then broadcast on bus and processors. Snoop and update all of the blocks copies. The memory is always kept freshly updated
This method is not preferred since it needs a broadcast for each step of write, which need more bandwidth and lead to more traffic.
21
Snoop-based protocol(cont..)
On the case of Write Invalidate Protocol: It has one writer and many readers In order to write to shared data:
An invalidate is sent to all caches which snoop and invalidate any copies.
When Read Miss occurs: Write-back: snoop in caches to find most recent copy.
It is used in Most modern multicore systems since it has less bus traffic.
22
Some Snoop Cache Types“based on the block states”
Basic Protocol Modified, Shared and Invalid
Berkeley Protocol Owned Exclusive, Owned Shared, Shared and Invalid
Illinois Protocol Private Dirty, Private Clean, Shared and Invalid
MESI Protocol Modified, exclusive, Shared and Invalid
23
Snoopy-based protocol
Each block of main memory is in one state: Clean in all caches and up-to-date in memory (shared) Dirty in exactly one cache (exclusive) Un-cached Not in any cache
Each cache block can be in one of following states : Modified: Only the valid copy in any cache and its
value is different from the main memory copy. Shared: A valid copy, but other caches may also have it. Invalid: block has no valid data. Exclusive: This copy has not been modified yet but it is
the only valid copy in any cache.
24
Example 1:
If processor 1 wants to read block A : Read Hit:
If block A is in its own cache, there will be read hit. Read miss:
If block A is not in its own cache. Therefore, it will send broadcast to see:–If any other cache has valid copy of this block if so, it will get it from there.–If not, it will get it from the main memory as the following:
25
Processor 1 Processor 4Processor 3Processor 2
caches
Main Memory
caches caches caches
If processor1 wants to read block A from main memory
A
Controller Bus
ReadA
S
A Clean
Read miss
26
Example 2:
If processor 3 wants to read block A and it is on cache of P1
27
Processor 1 Processor 4Processor 3Processor 2
caches
Main Memory
caches caches caches
Example 2:If processor 3 wants to read block A and it is
on cache of P1
A S
ReadA
A S
Controller Bus
A Clean
Read miss
28
Example 3:
If processor 4 wants to write on block A and it is on cache of P1 and P3
29
Processor 1 Processor 4Processor 3Processor 2
caches
Main Memory
caches caches caches
Example 3:If processor 4 wants to write on block A and it is on cache of P1 and P3
A
A AAS SI I M
Controller Bus
Dirty
WriteA
Write miss
30
Example 4:
If processor 3 has an invalid copy of block A and wants to read block A and there is modified copy of it on cache of P4
31
Processor 1 Processor 4Processor 3Processor 2
caches
Main Memory
caches caches caches
Dirty
A I A I MA
Read miss
ReadA
A
A
S
Clean
SA
Write- BackTo main memory
Example 4:If processor 3 has an invalid copy
of block A and wants to read block A and there is modified
copy of it on cache of P4
32
Requests from the processor:Request Source Block state Action
Read hit Proc Shared/Exclusive Read data in cache
Read miss Proc Invalid Place read miss on bus
Read miss Proc Shared Conflict miss: place read miss on bus
Read miss Proc Modified write back block, place read miss on bus
Write hit Proc Exclusive Write data in cache
Write hit Proc Shared Broadcast on bus to invalidate other copies
Write miss Proc Invalid Place write miss on bus
Write miss Proc Shared Conflict miss: place write miss on bus
Write miss Proc Modified write back, place write miss on bus
33
Request Source Block state Action
Read miss Bus Shared No action; allow memory to respond
Read miss Bus Modified Place block on bus; change to shared
Write miss Bus Shared Invalidate block
Write miss Bus Modified Write back block; change to invalid
Requests from the bus:
34
Important Observations
If any processor now wants to write in its block, it has to upgrade its block state from shared to exclusive copy.
By write-back method, the main memory will be updated once the processor which has a modified copy wants to change its state to shared state.
35
Directory-based protocolEach processor (or cluster of processors) has its own memory
• The directory is also distributed along with the corresponding memory
Each processor has:– Fast access to its local memory
– Slower access to “remote memory which is located at other processors
The physical address is enough to determine the location of memory.
Processing nodes:• The nodes are connected with a scalable interconnect, resulting in
routing of the messages from sender to receiver instead of broadcasting.• Cannot snoop anymore, thus records of sharing state is now kept in the
directory in order to track them.
36
Directory-based protocol(Cont..)
Typically three processors involved:
– Local node: where a request creates.
– Home node: contains the memory location of an
address.
– Remote node: contains a copy of the cache block,
either exclusive or shared.
37
Directory-based protocol(Cont..)
cache states:– Shared:
• At least one processor has cached data
• Memory is up-to-date
• Any processor is able to read the block
– exclusive:
• Only one processor (the owner) has the data cached.
• Memory will be staled.
• only that processor can write to it
– Invalid (Un-cached):
• No processor has the data cached.
Bit-vector use in order to keep tracking which processors have data in shared state or If it is exclusive in one processor.
38
Directory-based protocol(Cont..)
Processor 4 & its Caches
Processor 3 & its Caches
Processor 2 & its Caches
Processor 1 & its Caches
I / O 3Memory
4
Directory
I / O 2Memory
3
Directory
I / O 1Memory
2
Directory
I / O 0Memory
1
Directory
Interconnection network
39
Directory-based protocol(Cont..)
Assuming processor 1 wants to read block A and from the address of the block “2.5 GB”:
1. The processor1 will recognize that block A is in the memory of processor 3.
2. The processor 1 will send request to the node 3.
3. The directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block
40
Processor 4 & its Caches
Processor 3 & its Caches
Processor 2 & its Caches
Processor 1 & its Caches
I / O 3Memory
4
Directory
I / O 2Memory
3
Directory
I / O 1Memory
2
Directory
I / O 0Memory
1
Directory
Interconnection network
Read AAddress 2.5 GB
Read miss
AA
S
S: p1
Then send a copy of A to P1 then put the block in shared state and
keep tracking it
Then the directory of node 3 will check the state of this block and
make sure it is in the shared state
If P1 wants to read block A and from the address of the block “2.5 GB” the processor recognizes that it is in the memory of processor 3, so the
processor 1 will send request to the node 3
41
Assuming now processor 2 wants to read block A and from the address of the block “2.5 GB”:
1. The processor will recognize that it is in the memory of processor 3, also it is in shared state with processor 1.
2. The processor 2 will send request to the node 3.
3. Then the directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block
Directory-based protocol(Cont..)
42
Processor 4 & its Caches
Processor 3 & its Caches
Processor 2 & its Caches
Processor 1 & its Caches
I / O 3Memory
4
Directory
I / O 2Memory
3
Directory
I / O 1Memory
2
Directory
I / O 0Memory
1
Directory
Interconnection network
S
S: p1
A
A
Read AAddress 2.5 GB
Read miss
A
Directory check the state of the block
Then will put the block in shared state and keep
tracking it
, P2
S
43
• Example 3:
Assuming now processor 4 wants to WRITE in block A and from the address of the block “2.5 GB”1. The processor recognizes that it is in the memory of processor 3, also it is in shared state with processor 1, and 2.2. The processor 4 will send request to the node 3.
44
3. Then the directory of node 3 will check the state of this block and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACK since there is no Bus used here.
4. The directory will be updated by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block.
45
Processor 4 & its Caches
Processor 3 & its Caches
Processor 2 & its Caches
Processor 1 & its Caches
I / O 3Memory
4
Directory
I / O 2Memory
3
Directory
I / O 1Memory
2
Directory
I / O 0Memory
1
Directory
Interconnection network
Write miss
SA S Write AAddress 2.5 GBAI I
S: p1, P2
If processor 4 wants to WRITE in block A and from the address of the black
“2.5 GB” the processor recognizes that it is in the memory of node 3
So, the processor 4 will send request to the node 3
Then the directory of node 3 will check the state of this black and make sure it is in the shared state after that will send node to node request to P1 and P2 to change
the state of A from share to invalid and wait for ACK
The directory will update its by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block.
Node to node message to P1 to change the state
Node to node message to P2 to change the state
E: p4
ACK ACK
A
E
46
Example 4:
Assuming now processor 1 wants to READ block A BUT its copy is invalid . So, from the address of the block “2.5 GB”
1. The processor recognizes that it is in the memory of processor 3, BUT it is in Exclusive state with processor 4, so the processor 1 will send request to the node 3.
2. Then the directory of node 3 will check the state of this block and find out it is in Exclusive state with p4
47
3. So, node 3 will forward the request to node 4 which will change the block state to shared and by write back technique it will update the memory of node 3 by the updated copy of block A.
4. After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node )
5. Finally, the directory of node 3 will update its table and keep tracking of this block.
48
Processor 4 & its Caches
Processor 3 & its Caches
Processor 2 & its Caches
Processor 1 & its Caches
I / O 3Memory
4
Directory
I / O 2Memory
3
Directory
I / O 1Memory
2
Directory
I / O 0Memory
1
Directory
Interconnection network
A AI I
E: p4
MA
processor 1 wants to read block A BUT its copy is invalid . 3, so the processor 1 will send request to the
node 3 which will check the state of this black and find out it is in Exclusive state with p4
Read miss
Read AAddress 2.5 GB
Node 3 will forward the request to node 4 which will change the block state to shared and by write back technique will update the memory of node 3 by the
updated copy of block A.
Read A for P1Address 2.5 GB
SA
A
After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the
node 1 (Local node )
A
S
Finally, the directory of node 3 will update its table and keep tracking of this block.
S: p1,P4
49
Directory Actions If block is in un-cached state:
Read miss: send data, make block shared Write miss: send data, make block exclusive
If block is in shared state: Read miss: send data, add node to sharers list Write miss: send data, invalidate sharers, make exclusive
If block is in exclusive state: Read miss: ask owner for data, write-back to memory, send
data, make shared, add node to sharers list Data write back: write to memory, make un-cached Write miss: ask owner for data, write to memory, send data,
update identity of new owner, remain exclusive
50
Snoopy-Based Advantages and Disadvantages
Adv: The average miss latency is low, especially for cache-to-cache misses. In case of having small number of processors, snoopy will be fast.
Dis: The cache coherence overhead and the speed of shared buses limit the
bandwidth needed to broadcast messages to all processors. For large systems, it is not scale since each request will be broadcasted to
all processors. Buses have limitations for scalability:
o Physical (number of devices that can be attached)o Performance (contention on a shared resource: the bus)
51
Directory-Based Advantages and Disadvantages:
Adv: The scale much better than snoopy protocols (no
broadcast required ). It can exploit random point-to-point interconnects
Dis: The directory access and the extra interconnect traversal
is on the critical path of cache to cache misses. The latency here is longer than snoopy protocol since
there are 3 hops (request, response, forward).
52
Observation study:
• Snoopy based protocol outperforms directory based in case of high bandwidth.
• As the number of processors are increasing, directory based outperforms snoopy based protocol [5].
53
54
55
Our Implementation Plan
We will implement these two schemes:
• Snoopy-based protocol
• Directory-based protocol
Also we will simulate the following:
• Cores
• Local caches
• Memory Access Patterns
56
Cont.. Our Implementation Plan
In this implementation, the following parameters will be
considered in order to deeply understand and see how the change
of these parameters might affect the performance of each scheme:
• Number of processors
• Cache/Block size
• Applied Coherence Protocol
Also, the collected results will be including quantities of hits and
misses for each cache level
In this project, we are going to classify the misses’ type as
compulsory miss, capacity miss, or conflict miss.
57
Thanks a lot…
58
References:1. J. Hennessy, D. Patterson. Computer Architecture: A Quantitative Approach (5th
ed.). Morgan Kaufmann, 2011.2. Hashemi, B., "Simulation and Evaluation Snoopy Cache Coherence Protocols with
Update Strategy in Shared Memory Multiprocessor Systems," Parallel and Distributed Processing with Applications Workshops (ISPAW), 2011 Ninth IEEE International Symposium on , pp.256,259, 26-28 May 2011
3. Ahmed, R.E.; Dhodhi, M.K., "Directory-based cache coherence protocol for power-aware chip-multiprocessors," Electrical and Computer Engineering (CCECE), 2011 24th Canadian Conference, pp.001036, 001039, 8-11 May 2011.
4. Emil Gustafsson and Bruno Nilbert,”cache coherence in parallel Multiprocessors”, Uppsala 24th February 1997, Department of computer science, Uppsala university 1997.
5. Milo M. K. Martin, Daniel J. Sorin, Mark D. Hill, and David A.: " Bandwidth Adaptive Snooping," 8th Annual International Symposium on High-Performance Computer Architecture (HPCA-8). (2002) 2-6