View
228
Download
1
Category
Preview:
Citation preview
CSCI 5593 – Advanced Computer Architecture – spring 2015
Supervised By: Professor Gita Alaghband
Team 1 Members:
Shahab Helmi
Anh Nguyen
Phuc Nguyen
Harish Kodali
2
What is Cache Coherence? ............................................................................................................. 4
Design Space of a Cache Coherence Protocols .............................................................................. 6
States ........................................................................................................................................... 6
Events .......................................................................................................................................... 8
Transactions ................................................................................................................................ 9
Invalidate vs. Update .................................................................................................................. 9
Protocol Types .............................................................................................................................. 10
Snooping protocols ................................................................................................................... 10
Directory protocols ................................................................................................................... 10
Hybrid protocols ....................................................................................................................... 10
Snooping Protocols ...................................................................................................................... 11
MSI Protocol ............................................................................................................................. 12
MESI Protocol .......................................................................................................................... 15
MOSI Protocol .......................................................................................................................... 18
Advanced Snooping protocols .................................................................................................. 20
A Case Study: Sun Starfire E10000 .............................................................................................. 21
Snooping Simulation ..................................................................................................................... 22
How to Run? ............................................................................................................................. 22
Parameters ................................................................................................................................. 22
Inputs......................................................................................................................................... 23
Metrics ...................................................................................................................................... 24
Code .......................................................................................................................................... 25
Results & Analysis ........................................................................................................................ 27
Experiment 1: Number of invalidate messages In MSI & MESI ............................................. 27
Experiment 2: ............................................................................................................................ 28
Experiment 3: ............................................................................................................................ 29
Experiment 4: ............................................................................................................................ 30
Directory Protocols ...................................................................................................................... 31
The baseline protocols: ............................................................................................................. 31
Adding Owned State (MOSI protocols) ................................................................................... 33
Implementation plan: .................................................................................................................... 36
3
Implementation ............................................................................................................................. 37
How to run the code: ..................................................................................................................... 43
Results & Analysis ........................................................................................................................ 44
Adding Owned State (MOSI protocols) ................................................................................... 47
Source code: .................................................................................................................................. 48
References ..................................................................................................................................... 50
4
What is Cache Coherence? Definition: “In computer science, cache coherence is the consistency of shared resource data
that ends up stored in multiple local caches.” (Wikipedia!)
Example: assume that the blue table shows the memory, the numbers in the left are address of
each memory block and A, B, and so on are value of each block. Green tables are 3 local caches.
Each cache contains copies of some memory blocks. As it could be observed in figure 1.a, all
copies of the same memory block has an equal value (consistent shared data).
In figure 1.b, the value of second block of cache 3 has been updated but this chance has not been
reflected to the memory and other caches which raises the cache coherence problem
Figure 1.a – Consistent shared data
Figure 1.b – Inconsistent shared data
5
Cache coherence protocols are used to solve the cache coherency problem and keep the data
consistent among all caches and memory.
Cache coherence protocols maintain the coherence by implementing the following invariant:
Single Writer, Multiple Readers (SWMR) invariant: for every single memory location at any
given time, only one core can write to it (and maybe read it) OR one or more cores can read from
it.
Cache coherence protocols uses finite states machines to implement the SWMR invariant. Each
coherence controller implements a set of finite state machines per block (in all sources that we
found, cache coherence protocols maintain the coherence at level of each blocks not bits).
Figure 2 – Cache coherence structure
As it could be observed in figure 2, each cache controller has two sides:
1. Network side: interfaces to rest of the system using the interconnection network. For example,
if a cache miss occurs, the cache controller issue a coherence message to other caches or
memory (according the protocol type) and receive the data or could send data to other cores or
memory.
2. Core side: interfaces the core. Receives load and store requests from the core and sends
/receives data to/from core.
Memory controller is similar to the cache controller, except it does not have the core side.
6
Design Space of a Cache Coherence Protocols Each protocol has 4 main elements.
States
The state of each cache block consist of below properties:
Validity: A valid block has the most up-to-date value for this block. A valid block could be
read, but written if it is also exclusive.
Dirtiness: a cache block is dirty if its value is the most up-to-date value, and this value differs
from the value in the memory.
Exclusivity: a cache block is dirty if its value is the most up-to-date value, and this value
differs from the value in the memory.
Ownership: a cache controller (or memory controller) is the owner of a block if it is
responsible to for responding to coherence requests for that block. An owned block cannot be
evicted without giving the ownership to another block. In most protocols, there is exactly one
owner for each block.
Most protocols use a subset of the classic five state MOESI model (pounced MO-Zee) which are
called valid states. Each has different combination of properties, described above.
Modified: valid, exclusive, owned, and potentially dirty. May be read or written. The only
valid copy of this block. Should respond the requests for this block. The memory copy of this
block is potentially stale.
Shared: Valid, not exclusive, not dirty, and not owned. The cache has a read-only copy of this
block. Other caches might have valid, read-only copies of the block.
Invalid: the block is invalid. The cache either does not contain the block or have a stale version
of it. It may not be read or written.
Owned: the block is valid, owned and potentially dirty but not exclusive. The cache has a read-
only copy of this block and should respond to the requests for this block. The memory copy is
potentially stale.
Exclusive: valid, exclusive, and not dirty. The cache has a read-only copy of this block. The
memory copy of this block is up-to-date.
Transient states occur during the transition from one stable state to another one. XYz means that
the block is transition from stable state X to stable state Y and the transition will not be complete
until an event of type Z occurs. As an example, IMD denotes that a block was in the “I” state and
will become in the “M” state when data (D) is received.
There are 2 general approaches to naming states of blocks in the memory. The choice of the
naming does NOT affect the functionality or performance.
Cache-centric: the state of block in the memory is an aggregation of the block in the caches.
For example, if a block in all caches is in state “I”, the memory state for this block is “I”. If
7
one or more copies are in “S”, then the block in “S” in memory. If block in one cache is in
state “M”, it is in “M” in memory.
Memory-centric: the state of the block corresponds to the memory controller's permission to
this block. For example, if all if a block in all caches is in “I”, the memory state for it will be
“O” because the memory will behave like its owner. If they are all in “S” the memory state
will be “O”. If the block is in “M” or “O” in one cache, then its memory state will be “I” since
the memory has the invalid copy.
To maintain the state of blocks in caches, the most common way is to add some extra bit at the
end of each block. For example, in MOSEI we need 3 bits to show the state.
To maintain the state of blocks in memory, we can use the same approach. Alternatively, we can
use logical gates. For example, as depicted in figure 3, a NOR gate could be used to determine the
block state in memory.
Figure 3 – Using logical gates to determine states of the memory blocks
8
Events
Events are core requests to their cache controllers. In table 1, common events and their meaning
are shown:
Table 1 – Common coherence events
9
Transactions
Transactions are all initiated by cache controllers that are responding to requests from their
associated cores. Most protocols have a similar set of transactions, because the basic goals of the
coherence controllers are similar.
Table 2 – Common coherence transactions
Invalidate vs. Update
The other major design decision in a coherence protocol is to decide what to do when a core writes
to a block. There are two options:
Invalidate protocols: when a core wishes to write to a block, it initiates a coherence
transaction to invalidate the copies in all other caches. Thus; if other cores want to read this
block, they need to issue a new request to obtain a new copy of this block.
Update protocols: when a core wishes to write a block, it initiates a coherence transaction to
update the copies in all other caches to reflect the new value it wrote to the block.
Tradeoffs:
Update protocols reduce the reading latency.
They use more bandwidth since their messages are bigger (carry data as well).
10
Protocol Types Cache coherence protocols could be classified in 3 groups:
Snooping protocols
Idea: all coherence controllers observe (snoop) coherence requests in the same order. By
requiring that all requests to a given block arrive in order, a snooping system enables the distributed
coherence controllers to correctly update the finite state machines that collectively represent a
cache block’s state.
Traditional snooping protocols broadcast requests to all coherence controllers, including the
controller that initiated the request. The coherence requests typically travel on an ordered broadcast
network, such as a bus.
Directory protocols
Idea: a coherence request is issued by a cache controller for a block and unicasted to the memory
controller that is the home for that block. Each memory controller maintains a directory that holds
state about each block in the memory.
Hybrid protocols
Hybrid protocols are a combination of snooping and directory protocols. Snooping protocols are
simpler for implementation than directory protocols but because of the large number of messages
that snooping protocol exchange, they cannot be scaled up to the high number of cores.
Hybrid protocols are designed to take advantage of good properties of each type.
11
Snooping Protocols The baseline system that is used in our project, could be observed in figure 4. As it could be
observed, each core has its own cache and cache controller. Cache controllers are able to
communicate to each other and to memory using the interconnection network. We assume that,
the interconnection network is in charge of ordering the coherence messages.
Figure 4 – the baseline system
12
MSI Protocol
MSI is the simplest snooping protocol and has just three states: Modified, Shared and Invalid. In
the initialization phase, all cache blocks are in state I and all memory block are in state M (since
the memory is the owner).
We assume that MSI implements two atomicity properties:
Atomic requests: states that a coherence request is ordered in the same cycle that it is
issued.
Atomic transactions: states that coherence transactions are atomic in that a subsequent
request for the same block may not appear on the bus until after the first transaction
completes (i.e., until after the response has appeared on the bus).
In figure 5, you could observe the state diagram of the MSI protocol and its corresponding table
in table 3.
Figure 5 – MSI cache controller state diagram
13
Table 3 – MSI cache controller table
For example, if the cache contains a copy of the memory block in the invalid (I) state and core
sends a load request for that block, the cache controller will issue a GetS request and broadcasts it
to all other controllers and update the cache block state to ISD and waits (stalls) until the requested
data is received. When data is received, the cache controller updates the block state to S.
The cells that are marked with (A) means that they are forbidden because of the atomic transactions
property.
In figure 6 and table 4, you could observe the state diagram and its corresponding table of MSI
memory controller, accordingly.
Figure 6 – MSI memory controller state diagram
14
Table 4 – MSI memory controller table
Advantages:
Small table and few possible states.
Easy to understand and implement
Multiple copy of a same block could be available because of the shared state.
Disadvantages:
Many impossible states due to atomic transaction property and many stalls:
Lower throughput
Higher latency
Unnecessary broadcast of invalidate messages: when a core wants to write on block should get
the block in the stat M and send an invalidate message to all other cores, no matter if it is the
only copy of that block or not.
Tradeoffs: downgrade from M to S or I? We need to predict if block is going to be used again
or not.
15
MESI Protocol
Implements atomic transactions and non-atomic request properties.
The Exclusive state is used in almost all commercial coherence protocols because it optimizes a
common case: a core first reads a block and then subsequently writes it.
In MSI, a core needs to issue a GetS message to get the read permission (in case a cache
miss) and then have to issue a GetM message to get the write permission.
In MESI, a core can get the block in the exclusive state and no other block can access it anymore.
Thus, the core does not need to issue a GetM message.
In figure 7 and table 5, you could observer the MESI cache controller state diagram and its
corresponding table, accordingly.
Figure 7 – MESI cahce controller state diagram
16
Table 5 – MESI cache controller table
As mentioned, we assumed that MESI does not implement the atomic request property. Thus, ISAD
means that the cache controller has issued the GetS request and updating the cache block state
from invalid to the shared but is not ordered yet. As soon as its request is ordered, its state will be
updated to ISD and it will wait to receive the requested data. Finally, when the data is received the
state will be update to the shared state.
In figure 8 and table 6, you could observe the state diagram and its corresponding table of MESI
memory controller, accordingly.
17
Figure 8 – MESI memory controller state diagram
Table 6 – MESI memory controller table
Advantages:
Silent transition from the exclusive state to the modified/shared state. No unnecessary
invalidate messages are issued.
Read and write with issuing only one request.
Fewer number of messages.
Less traffic on the bus, lower bandwidth usage.
18
Disadvantages:
Extra hardware is needed to implement the exclusive state.
MOSI Protocol
When a cache has a block in state M or E and receives a GetS from another core, if using the MSI
protocol or the MESI protocol, the cache must
change the block state from M or E to S
send the data to both the requestor and the memory controller
Questions raise that how a snooping protocol can minimize accesses to memory or eliminate:
The extra data message to update the memory when a cache receives a GetS request in the
M (and E) state?
The potentially unnecessary write to the memory?
The key difference: when a cache with a block in state M receives a GetS from another core,
In a MOSI protocol, the cache changes the block state to Owned (instead of S) and retains
ownership of the block (instead of transferring ownership to the memory) The O state enables
the cache to avoid updating the memory. Thus, when a controller requests a block, if one the
caches has the block in state O will send it to the requestor and there is no need to load if from
the memory.
In figure 9 and table 7, you could observer the MOSI cache controller state diagram and its
corresponding table, accordingly.
Figure 9 – MOSI cache controller state diagram
19
Table 7 – MOSI cache controller table
Differences with the MESI protocol are shown in red.
In figure 10 and table 8, you could observe the state diagram and its corresponding table of MOSI
memory controller, accordingly.
Figure 10 – M0SI memory controller state diagram
20
Table 8 – MOSI memory controller table
Advanced Snooping protocols
Pipelined (non-atomic) bus: we could send multiple coherence request without waiting for the
response. However, responses should be received in order. For example, as illustrated in figure 11,
the data response of request 1 should be received before the data response of request 2.
Figure 11 – Pipelined (non-atomic) bus
Split Transaction (non-atomic) bus: provides responses in an order different from the request
order. Thus, this method increase the system throughput since controllers do not need to wait for
each other, see figure 12.
Figure 12 – Split Transaction (non-atomic) bus
For more info about the MSI protocol with split-transaction bus, please refer to our slides.
21
A Case Study: Sun Starfire E10000 Uses MOESI.
Non-atomic requests and transactions.
Supports up to 64bit processors.
Wired snooping busses consume lots of energy; thus, they do not scale up to large number of
cores. To solve this problem. E10000 uses point-to-point links instead.
Uses a separate bus for sending out-of-order data response messages.
In this system, all controllers unicast their coherence requests to the root node and the root node
will order the request and broadcast them to all other controllers. You could see a high-level
architecture of this system in figure 13.
Figure 13 – Sun Starfire E10000 message passing
22
Snooping Simulation We implemented 3 different cache coherence simulations. One for snooping protocols in C# and
2 for directory protocols in Java and C++ according to our skills in each programming language.
We tried our best to follow the same algorithms, using the same parameters, inputs and so on
(especially in the C# in Java version). However, we did not have enough time to match everything
but main elements of both simulations are the same.
How to Run?
If you would like to view and edit the code, you will need Microsoft Visual Studio 2013. Follow
the below path:
Team 1 Final Report (Root Folder) Code Snooping Simulator Snooping CSCI5593
If you would like to execute the code, you will need .NET Frameworks 4.5. Follow the below
path:
Team 1 Final Report (Root Folder) Code Snooping Simulator Snooping.exe
Parameters
You could observe a screenshot of the application (input entry) in figure 14.
Hardware Parameters:
Number of cores
Cache and memory latency (cycles)
Memory and cache size (number of blocks): we assumed that memory and cache blocks has the
same block size. Thus, we only need the number of blocks to calculate the memory and cache size.
Input Parameters:
Input size (number of load/store requests for each core): for each core we generate a series of
load/store instructions. For example if the input size is 100, 100 load/store instruction will be
generated for each core.
Store percentage (distribution of load vs store request): For example if it is set to 40, 40% of the
request will be store and 60% of them will be load request.
Larger input size / smaller memory higher probability that cores need a block at the same time
Larger store percentage more conflicts more stalls
23
Figure 14 – Parameter entry
Inputs
We tested our simulator using 12 different inputs to analyze the behavior of each protocol under
different conditions. We got the idea of having different inputs from the SPLASH-2, which a well-
known and widely used simulator. To ease the testing task for you, you could simply choose one
the inputs from the Test Name drop down list and values will be loaded automatically into fields.
You could see the specifications of each input in table 9a and 9b.
Table 9a – Input L1 to H3
As you can observe, all values are the same in all fields and just differ in input size and store
percentage. This helps us to see the effect of a single property on each protocol. L, M and H stand
for Light, Medium and Heavy accordingly which refer to the size of input.
24
Table 9b – Input MC1 to MW3
Metrics
We keep track of the below parameters in our application. Since we did not have enough time to
analyze all of them one by one, we just focused on the number of write-backs and the number of
invalidate messages which are used in our analysis.
Per core:
Write-backs
Memory reads
Invalidate messages
Coherence messages (broadcasting between cores)
Memory messages (coherence messages sent to the memory)
Data responses
Stalls
Cache hits
Cache misses
Replacements (evictions): when cache is full
Per Protocol:
Write-backs
Invalidate messages
Coherence messages (broadcasting between cores)
Memory messages (coherence messages sent to the memory)
25
All messages
Memory references (read/write from memory)
Stalls
Cache hits
Cache misses
Code
In this section, some high level information about the code is presented.
Input Folder:
This folder contains 4 classes:
InputElement.cs: this class defines the data structure of our input elements. Each input
element has 3 fields: _Command, _Core, and _BlcokID
Example: Core #2 needs to load the block which is originally located in the 25th block of
memory (its copies could be contained in caches!)
public string _Command = “Load"; public int _Core = 2; public int _BlockID = 25;
Generator.cs
This class generates the input (input elements) according to the input size and store
percentage parameters.
InputTest.cs
Generates an input using the Generator.cs class and outputs values. You could see a generated
sample by this class:
Load 3850 2 Load 5207 6 Load 7230 4 Store 4374 3 Store 5998 5 Load 4247 3 Store 7729 1 Load 1040 0 Store 862 2 Store 4738 4 Load 2152 7 Load 6976 1 Store 8759 6 Store 8347 3 Load 7171 0 ..
Tests.cs
Includes 12 predefined configurations, as mentioned in the input section, that we used to test
our simulator.
26
Cores Folder:
Core.cs
We use this class for keeping track of our metrics for each core, such as the number of write-
backs, invalidate messages and so on.
CacheBlock.cs
Each cache block has 3 fields:
MBlockID: the address of the cache block.
State: shows the state of the block in cahce: M, S, E, O, I or X as empty
Dirty: indicates if the block is dirty or not.
Protocols Folder:
MSI.cs, MESI.cs, and MOSI.cs which contains the implementation for each protocol. Some
common methods of these classes are:
Load: loads a block into cache.
Evicts: if the cache is full, this method chooses and evicts it.
UpdateState: updates the state of a block. For example: M S
OWNE(BlockID = 1, Core = 2) : returns true if the cache 2 has a copy of the memory
block with id = 1 in the E state.
Two other classes are just some data structures that are used in the code.
Statistics Folder:
Contains two classes Sumarry.cs and ProtocolStatistics.cs.
These classes calculate the number of messages, cache hits and so on for each protocol.
For example:
Cache Hits = cache hits of cache 1 + cache hits of cache 2 + …
27
Results & Analysis As we figured out, the output results are highly dependent to the input. So, it is possible that you
see different results, using the same input configurations. Thus, we executed each test around 10
times and computed the average values. However, for the H (heavy) inputs, the runtime will be so
long (even hours!) so we ram them 1-2 times only.
Experiment 1: Number of invalidate messages In MSI & MESI
Motivation: as mentioned in previous sections, one of the main design goals of the MESI protocol,
was to reduce the number of invalidate messages. The reason for this reduction is that when a
cache contains a block in state E, it does not need to broadcast the invalidate messages when its
state is upgraded to M. As you could observe the results of this experiment in figure 15, in all cases
(except L1) MESI always broadcast fewer number of invalidate messages.
We first thought the results of L1 happened because of the bug in our code but when we checked
the MESI state table, we noticed that it broadcasts the invalidate message when a block
downgrades from E to I or S (receiving a GetS request from another core). Thus, there are some
invalidate messages even when there is no conflict (write conflict) in the input.
Figure 15 – Number of Invalidates in MSI vs. MESI
28
Experiment 2: Number of write-backs in MSI & MOSI
Motivation: as mentioned in previous sections, the design goal of the MOSI protocol is to reduce
the number of write backs from memory compared to MSI. The reason is that when a controller
downgrades a block state from M to S or I due to a GetS request of that block from another core,
the new core becomes the owner of that block and there will be no need to write-back the block
on memory. As figure 16 shows, the number of write-backs in the MOSI was always less than
MSI. However, the difference was not significant.
Figure 16 – Number of write-backs in MSI vs. MOSI
29
Experiment 3: Effect of Cache Size on the Number of Write-Backs
Motivation: As we studied in this course, increasing the size of cache would reduce the number
of evictions and write-backs since the cache could fit more memory block in it and improve the
locality as well. We tested this using three different inputs and in each one multiplied the size of
cache by 10. In MC1, MC2 and MC3 each cache has 10, 100 and 1000 blocks accordingly. As it
could be observed in figure 17, the number of write-backs reduces as the cache size increases.
Figure 17 – Write-back reduction vs cache size
30
Experiment 4: Write-back Reduction vs. Number of Cores
Motivation: as mentioned in previous sections, the design goal of the MOSI protocol is to reduce
the number of write backs from memory compared to MSI. We expected that as we increase the
number of cores, MOSI becomes more and more effective in terms of broadcasting the number of
write-backs compared to MSI.
To see the effect of this, we normalized values and contrary to our expectation, the ratio of number
of write backs almost stayed the same while we added the number of cores (as you could observe
in figure 18). The exact reason of this is not clear for us. It could be a bug in our code or our
assumption could be wrong.
Figure 18 – Write-back reduction vs # of cores
31
Directory Protocol
The baseline protocols:
The key role of the directory is to maintain the state and data of every block on every caches. To
do that, the directory needs to listen all the request made by caches controller, process them, and
send back to the cache controller to ask for action. For example, a cache controller that wants to
issue a coherence request, it will send a GetS message directly to the directory using unicast (1-to-
1) method, then, the directory looks up the state of the block to determine what actions to take
next. Then, the directory indicate that the request block is owned by a cache in core A, it will
forward the request to that cache to respond to the requester. Note that all the communication are
done by unicast (1-to-1) method. The infrastructure of a directory based cache coherence is as in
Figure 14 There are multiple cores with corresponding cache controller. All of the cache controller
connect to the interconnection network and send the request to directory component located on the
memory.
Figure 14. The outline framework of directory-based cache coherence protocol.
We now describe the baseline system to give some basic information about how the directory
coherence work with three traditional state of cache: MSI. The blocks state would be changed from
one state to another. The algorithm for changing that state are as following:
From I to S: the cache controller sends a GetS request message to the directory
and change the block state from I to ISD. According to receiving the request from
that requestor, the directory sends back the data message and change the block state
that its own to S. When the data arrives the requestor, the cache controller change
the block’s to S, the transaction has been done successfully. Secondly, if the block
32
that holds the data is not the owned block of directory, the directory will forward
the request to the owner and changes the block’s state to the transient state SD. The
owner responds to this FwD-GetS by sending data to the requestor and chaing the
block’s state to S. Now, the previous oner need to send the data to the directory
since it should belong to the directory. When data is arrived the directory, the
directory copies it to memory, changes the block state to S, and considered the
transaction complete. Figure … illustrates above explanation.
Figure 15. Possible the transition from I state to S states
From I or S to M:
The cache controller sends a GetM request to the directory and change the block’s
state to IMAD. If the directory owns the blocks that is the same with the requestor, directory
will send data back to the requestor immediately. However, in the case the directory doesn’t
own the blocks that requestor requests, it will forward to the node that owns that data. That
data then would be forwarded from the owner to the requestor. Figure below illustrates two
different cases of changing state from I or S to M.
Figure 16. Possible the transition from I state to M states
From M to I.
The cache controller first sends a PutM request including the data that it wants to
modify and change the block state to MIA. When the directory receives this
PutM, it will update to the memory, respond by Put-Ack message. Once the
requestor receives the Put-Ack, the block’ state change to I. As the same in
previous case, if the cache controller receives a forwarded coherence request
33
(Fwd_getS or Fwd-GetM) between sending the PutM and changes its blocks to
SIA or IIA. The figure below illustrates how the messages and state change.
Figure 17. Possible the transition from M state to I states
From S to I.
To replace a block in state S, the cache controller will send the PutS message and
wait for the Put-Ack message from directory. Until the requestor receives the Put-
Ack message from the directory, it will change the state to I.
Figure 18. Possible the transition from S state to I states
Adding Owned State (MOSI protocols)
1 - Overview
In this section, we will present the conceptual idea of MOSI directory-based protocol together with
the implementation in detail. As the same in snooping protocols, when adding Owned(O) state,
the block is then valid, read-only, dirty (it must eventually update to memory), and own (the cache
must respond to coherence request for the block). Comparing with MSI in directory-based
protocol, adding owned state create three main changes:
1. More coherence request are satisfied by caches (in O state) than by the LLC/memory
2. There more 3-hop transactions, which is more immediate transactions in between.
The key different here compared with MSI baseline protocol is that the transaction in which a
requestor of a block in state I or S sends a GetM to the directory when the block in the O state in
34
the owner cache and in S state in the one or more sharer caches. In this case, the directory
forwards the GetM to the owner and appends the AckCount. The owner receives the Fwd-getM
and responds to the requestor with Data and the AckCount. Moreover, this protocol has a PutO
transaction that is nearly identical to the PutM transaction. It contains data for the same reason
that PutM transaction contains data, i.e, because both M and O are dirty states. The figures
below show the additional states and messages by adding O state. And the Table 9 shows the key
different between MSI and MOSI, the states and action by adding O state are highlighted with
yellow color.
Figure 19. Possible the transition from I state to S states
Table 9. Possible transition and state of MOSI protocol. The yellow cell illustrates the states that
only available in MOSI, not MSI protocol
35
2 - Distributed Directories. There is an approach to improve the performance of directory
protocol by preventing bottleneck situation, where all the requests from caches controllers go to
a single directory. We will use this approach for our implementation.
Figure 20. The outline of our implementation of cache coherence protocols.
3 - Non-Stalling Directory Protocols:
We recall that one of the limitation of directory protocols is that the stall situation happens
frequently. In other words, when a cache controller has a block in state IMA and receives a Fwd-
GetS, it processes the request and changes the block’s state to IMAS. This state indicates that
after the cache controller’s GetM transaction completes (i.e., when the last Inv-Ack arrives), the
cache controller will change the block state to S. The cache controller must also send the block to
the requestor of the GetS and to the directory, which is now the owner. Therefore, by not stalling
on the Fwd-GetS, the cache controller can improve performance by continuing to process other
forwarded requests behind that Fwd-GetS in its incoming queue.
4 - Case Studies: SGI Origin 2
Flat memory-based directory protocol
Uses a bit vector directory representation
Consists 512 nodes
Two processors per node, but there is no snooping protocol within a node –combining
multiple processors in a node reduces cost
Distinguishing Features
36
- As its scalability, each directory entry contains fewer bits than necessary to present
every possible cache that could be sharing a block.
- Directory dynamically choose coarse bit vector or limited pointer presentation
- Since network provides no ordering, there are several new messages have been used
for reordering purposes
- Protocol considers all of these conditions by not enforcing ordering in the network
- Use only two networks request and response to avoid deadlock. Note that directory
has three types of message (request, forwarded request and response)
Figure 21. SGI Origin 2 structure.
After finishing our study, we come up with some advantages and disadvantages as follow:
Advantages
Supports scalability
Able to take care of ordering messages
Disadvantages
More complicated than Snooping
Has many transactions -> inefficient in time as they require an extra message when the home
is not owner
High storage overhead of directory data structure
Implementation plan: The simulator was written in C++ for distributed directories cache coherence protocol. We try our
best to map from the concept of the protocols to the real environment. There are some assumptions
that we have been made for this implementation:
+ Number of cycle for executing every operation: 1 cycle
+ Number of directory = number of caches
37
Based on our understanding and analysis about the cache coherence problem, we come up with
some evaluation metrics as in following:
- Number of write backs vs. cache size and block size
- Number of write backs vs. cores
- Number of stalls vs. scores
- Number of cycles vs. scores
- Number of hits and misses vs. scores
We now show the implementation process together with the results that we get.
Implementation We first define the message and data type that we will use using typedef in C++. The figure below
show the type of message sending by cache controller to directory together with its change to
prevent cache coherence.
The following explanation of the simulation code is for MOSI, the same is applied for MSI. We
just need to remove the O state.
Input data:
The information that user need to input include:
+ Number of cache
+ Number of blocks
+ Number of requests
+ Number of cores
However, the number of cores need to be input manually by copy the following code blocks:
//core 1 for (int j = 1; j < requestnum; j++){ for (int i = 1; i < blocknum*cachenum; i+=2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(LOAD, status); } for (int i = 2; i < blocknum*cachenum - 1; i += 2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(STORE, status); }
38
The key components of our implementation are messages and states. Hereafter are the format
of message and the states that we used in our implementation.
The message from direction are contains information about LOAD, STORE, DATA, ACK, FWD
messages, …
typedef enum { NOP = 0, LOAD, // load message STORE, // store message DATA_FROM_DIR_NO_ACK, // data message from directory to requestor
//without acknowledgement DATA_FROM_DIR_ACK, // data from directory to requestor // including acknowledgement DATA_FROM_OWNER, // data to owner to requestor DATA_FROM_NON_OWNER, // data not from the owner ACK_COUNT, // ack count INV, // invalid message REPLACEMENT, // replacement message FWD_GETS, // forward GetS message from directory FWD_GETM, // forward GetM message from directory PUT_ACK, // Put Ack INV_ACK, // Invalid Ack LAST_INV_ACK // last Invalid Ack } messages; // message from directory
Meanwhile, the message sent from cache controller to directory is a bit simpler:
typedef enum { GETS = 1, GETM, PUTS, PUTM, PUTO, DDATA_FROM_OWNER, DDATA_FROM_NON_OWNER }dmessages; // message to directory
Hereafter is the list of all possible states on of the cache:
typedef enum { MOSI_M = 1, // M state MOSI_O, // O state MOSI_S, // S state
39
MOSI_I, // I state MOSI_ISD, // ISD state // transition from I to S state and wait for data MOSI_IMAD, // IMAD stat //transition from I to M state and wait for data and ack MOSI_IMA, // IMA state MOSI_SMAD, // SMAD state MOSI_SMA, // SMA state MOSI_MIA, // MIA state MOSI_OMAC, // OMAC state MOSI_OMA, // OMA state MOSI_OIA, // OIA state MOSI_SIA, // SIA state MOSI_IIA // IIA state } state_t;
Classes:
We use a number of classes as following, the function of each class is provided on the same line
of the function name with block commen (//)
void MOSI_protocol_cache_request(messages, state_t); //cache controller sends request to the directory void MOSI_protocol_directory_request(dmessages, state_t); // directory responds upon receiving the message void I_state_cache(messages,state_t); //what the cache need to do when it is in I state void Transition_I_to_SD(messages, state_t); // transition state from I to S, wait for data D to complete void Transition_I_to_MAD(messages, state_t); // transition state from I to M, wait for data D and the acknowledgement A to complete. void Transition_I_to_MA(messages, state_t); // transition state from I to M, wait for the acknowledgement A to complete. void S_state_cache(messages, state_t); // cache in S state. void Transition_S_to_MAD(messages, state_t); // transition state from S to M, wait for data D and the acknowledgement A to complete. void Transition_S_to_MA(messages, state_t); // transition state from S to M, wait for the acknowledgement A to complete. void M_state_cache(messages, state_t); // cache in M state void Transition_M_to_IA(messages, state_t); // transition state from M to I, wait for the acknowledgement A to complete. The similar explanation are for the rest of the function.
40
void O_state_cache(messages, state_t); void Transition_O_to_MAC(messages, state_t); void Transition_O_to_MA(messages, state_t); void Transition_O_to_IA(messages, state_t); void Transition_S_to_IA(messages, state_t); void Transition_I_to_IA(messages, state_t); void X_state_cache(messages, state_t); void directory_I(dmessages, state_t); void directory_S(dmessages, state_t); void directory_M(dmessages, state_t); void directory_O(dmessages, state_t); void send_GET_to_Directory(dmessages, state_t); void send_Data_to_Requestor(messages, state_t); void send_Data_to_Directory(); void checkOWNER(dmessages, state_t); void checkGETS(state_t); void checkGETM(dmessages,state_t); void checkPUTS(dmessages, state_t); void checkPUTM(dmessages, state_t); void checkPUTO(dmessages, state_t); void copy_Data_to_mem();
We also present an example implementation of a MOSI cache request is as in Figure 22.
Note that depending on each state that the cache is staying, it will base on the message
received and do some corresponding actions.
Figure 22. Cache request implementation on cache controller
In addition, the figure 23 shows the action of a cache in I state. It needs to check the request from
directory, if the request is LOAD, it will send a message GetS to directory to request for changing
the state. Then the directory need to execute that request from the requestor and so on.
41
Figure 23. Cache controller implementation when the cache is in I state
The following figure is how a transition has been implemented.
Figure 24. Cache controller implementation when the cache is in I state
Then, the directory should be executed some functions to process the request, and Figure 25
shows how it was implemented.
42
Figure 25. Directory respond implementation
In order to verify the results, we need to manually test the state of the cache. Let’s do the test
with sending a change S request from I to S state, the output of the program should be as
following.
Figure 26. An example of debugging
As mentioned, we evaluate the system when varying different matrix and check the results. The
key problem of the simulation is that it generates the results differently according to the output
data. We are still figure out how to fix that problem.
43
How to run the code: This project has been developed in C++ language.
1 - Please download Microsoft Visual Studio to compile the project.
2 – The procedure of compiling the project is as following:
a. Open Visual Studio - > Select open project go to directory of our project
Figure 26. Microsoft Visual Studio interface for opening new project
b. Select CAProject.sln
c. If you want to run MOSI protocol simulation. Select Solution Explorer (on right
corner) -> Source Files -> right click -> Add -> Existing item … -> Navigate to
MOSI_protocol.cpp.
You can do similar steps for opening MSI protocol. Note that you have to run one at
every time
d. Press Ctrl + Shift + B at the same time to compile the project
e. Press Ctrl + F5 to run project. Input number to run the project.
44
Important: if you want to change the number of core, you have to copy the code following code
to the number of core that you want. This is inconvenient but we have no solution for this right
now.
// Core number
for (int j = 1; j < requestnum; j++){ for (int i = 1; i < blocknum*cachenum; i += 2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(LOAD, status); } for (int i = 2; i < blocknum*cachenum - 1; i += 2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(STORE, status); } }
Results & Analysis 1 – Number of write back vs. cores:
We change the number of core input and check the number of write backs made by MOSI and
MSI. As we add new state O reduces the number of write back to the memory, we achieve
significant improvement compared with the base solution MSI.
Figure 27. Checking the number of write back when varying the number of cores
0 500000 1000000 1500000 2000000 2500000
2 cores
4 cores
8 cores
16 cores
Number of write backs vs. number of cores
MOSI MSI
45
2 - Number of stalls vs. cores:
Adding more state mean increasing the number of transitions between states. As we mention, the
stall is happened at those movement where transition is on the way of its processing. However, as
we can see the increasing of stall are not too many. This is really a trade of for selecting the
protocol, we accept to lose the stall, but we can reduce the number of write backs to memory.
Figure 28. Checking the number of stall when varying the number of cores
3 - Number of cycles vs. cores:
We attempt to calculate the number of cycles because we want to find how many cycles we need
for adding one more state, “O” state. The result below shows that MOSI spend more cycles than
MSI. However, after thinking more carefully, we are assuming all operations are done in 1 cycle,
while it is not in reality. For example, the write back cycle should be 100 and the other things
should be done in 3-5 cycles.
Figure 29. Checking the number of cycles when varying the number of cores
0
5000000
10000000
15000000
2 cores 4 cores 8 cores 16 cores
Number of stalls vs. cores
MSI MOSI
0
10000000
20000000
30000000
2 cores 4 cores 8 cores 16 cores
Number of cycles vs. cores
MOSI MSI
46
More importantly, we also calculate the number of hits and misses that obtained by applying MOSI
and MSI protocol. The results shows that MOSI is also achieved the better performance in term of
hits and misses. This also reflects a positive side of putting Onwer state.
Figure 30. Checking the number of cycles when varying the number of cores
Figure 31. Checking the number of cycles when varying the number of cores
Summary:
+ We need to spend more stall by adding O state, however, we reduce the number of
write backs, which is more efficient for the system. However, the simulator depends on the input
parameters, which lead to unpredictable results. The implementation of MOSI and MSI have
been developed in C++. In order to run MESI, you need to use JAVA to compile the code. The
next section will cover this information.
0
2000000
4000000
6000000
8000000
2 cores 4 cores 8 cores 16 cores
Number of hits vs. cores
MOSI MSI
0
2000000
4000000
6000000
8000000
2 cores 4 cores 8 cores 16 cores
Number of misses vs. cores
MOSI MSI
47
Adding Owned State (MOSI protocols)
Motivation: - In MESI core takes only one coherence transaction to read and write into a
block but in msi to read and write it took two coherence transactions.
Mechanism: -If GetS is requested by one core to a block and if that block is not shared by other
cores then requester core can obtain a block in state E. Here the core can silently upgrade from
state E to M without issuing another coherence request. Here we have a question from the snooping
MESI protocol that we have to make E as owner or not. In this protocol solution is simple that E
is owner here, so the cache with block E has to respond to request. Because cache has to send PutE
request to directory to know that directory is now owner for that block to respond to incoming
requests.
Consider the block is not owner then the protocol complexity increases. For example if we consider
three cores then the first core block is in E state and at the same time if the directory gets and
request from second core either GetS or GetM at that time directory thinks that first core is in any
of E, M or I. If it is in M then directory forward request to first core then core one respond to
request. Similarly if it is in E state then core one or directory can respond to request and in I state
directory only wants to respond.
Diagrammatic representation of mechanism explained above
This protocol is somehow more complex than MSI protocol which added complexity at directory
controller in additional to more states the directory controller must distinguish between more
possible events. The complexity from events is when PutS arrives the directory must distinguish
weather this PutS is last PutS or not if it last puts then directory has to change to state I.
Implementation of MESI_DIRECTORY
The main thing in implementing is each block in memory there is corresponding directory entry.
Each and every time directory has to respond to the request. Directory can maintain each state of
the block. Everything has to check with directory for the block state weather it is in I,E,M,S.Every
time cache process the request to directory and directory centralized the request responds based on
48
the current state of block. I implemented the all the states with GetS and GetM in directory
controller and load and store in cache controller.
Simulation of MESI_DIRECTORY
I implemented the code for MESI directory protocol using java and the parameters is number of
cores on based on number of core the miss and hit of data is calculated. If the requestor get data
back to then hit will increases if not that mean the block state not allowed to respond to requested
state then it count as miss. We are incrementing the cycles each and every time when the loop
executes then we can know how many time the loop iterate. MESI_CACHE method in code
completely deals with cache controller and MESI_DIRECTORY completely deals with directory
controller.
Source code: Download and install Java SE Development kit from.
o https://www.oracle.com/java/index.html
Download and install the Eclipse IDE for Java Developers.
https://eclipse.org/
Open Eclipse and import the provided code into Eclipse.
Complete the requested parts of the provided code that includes
ADVANCED COMPUTER ARCHITECTURE
IMPLEMENTATION OF MESI_DIRECTORY PROTOCL
Download the MESI_DIRECTORY_PROTOCOL_SOURCE_CODE in that folder
open the src folder the you will find MESI_DIRECTORY3 file please execute it
using java6 or above the parameters are number of cores(number) then you will get
hits, miss and cycles total.
To compile and run MESI_DIRECTORY3.java you can use eclipse IDE”
In code file you will find MESI_DIRECTORY_CAHE and MESI_DIRECTORY which
contains the each and every state procedures for example in MESI_DIRECTORY I state
procedure represent as I_state_directory(request) and in MESI_CACHE
I_state_cache(request).
49
Step by step procedure to Write and implement java code in Eclipse,
o Open Eclipse.
o From the top menu click File on ---> New --- > Java Project on.
o Choose a name for your project and click on the Finish button.
o Add MESI_DIRECTORY3.java file
o Right click on the class and run as java application.
50
References [1] - Daniel J. S., Mark D. H. David A. W., “A Primer on Memory Consistency and Cache
Coherence,” Morgan Claypool Publishers, 2011.
[2] – Suleman, Linda Bigelow Veynu Narasiman Aater. "An Evaluation of Snoop-Based Cache
Coherence Protocols."
[3] – Tiwari, Anoop. Performance comparison of cache coherence protocol on multi-core
architecture. Diss. 2014.
[4] – Chang, Mu-Tien, Shih-Lien Lu, and Bruce Jacob. "Impact of Cache Coherence Protocols on
the Power Consumption of STT-RAM-Based LLC."
[5] – CMU 15-418: Parallel Architecture and Programming. Lecture Series. Spring 2012
Recommended