CSCI 5593 Advanced Computer Architecture …cse.ucdenver.edu/~anhnguyen/CSCI_5593/Cache Coherence Simulation.pdfCSCI 5593 – Advanced Computer Architecture – spring 2015 Supervised

CSCI 5593 – Advanced Computer Architecture – spring 2015

Supervised By: Professor Gita Alaghband

Team 1 Members:

Shahab Helmi

Anh Nguyen

Phuc Nguyen

Harish Kodali

2

What is Cache Coherence? ............................................................................................................. 4

Design Space of a Cache Coherence Protocols .............................................................................. 6

States ........................................................................................................................................... 6

Events .......................................................................................................................................... 8

Transactions ................................................................................................................................ 9

Invalidate vs. Update .................................................................................................................. 9

Protocol Types .............................................................................................................................. 10

Snooping protocols ................................................................................................................... 10

Directory protocols ................................................................................................................... 10

Hybrid protocols ....................................................................................................................... 10

Snooping Protocols ...................................................................................................................... 11

MSI Protocol ............................................................................................................................. 12

MESI Protocol .......................................................................................................................... 15

MOSI Protocol .......................................................................................................................... 18

Advanced Snooping protocols .................................................................................................. 20

A Case Study: Sun Starfire E10000 .............................................................................................. 21

Snooping Simulation ..................................................................................................................... 22

How to Run? ............................................................................................................................. 22

Parameters ................................................................................................................................. 22

Inputs......................................................................................................................................... 23

Metrics ...................................................................................................................................... 24

Code .......................................................................................................................................... 25

Results & Analysis ........................................................................................................................ 27

Experiment 1: Number of invalidate messages In MSI & MESI ............................................. 27

Experiment 2: ............................................................................................................................ 28

Experiment 3: ............................................................................................................................ 29

Experiment 4: ............................................................................................................................ 30

Directory Protocols ...................................................................................................................... 31

The baseline protocols: ............................................................................................................. 31

Adding Owned State (MOSI protocols) ................................................................................... 33

Implementation plan: .................................................................................................................... 36

3

Implementation ............................................................................................................................. 37

How to run the code: ..................................................................................................................... 43

Results & Analysis ........................................................................................................................ 44

Adding Owned State (MOSI protocols) ................................................................................... 47

Source code: .................................................................................................................................. 48

References ..................................................................................................................................... 50

4

What is Cache Coherence? Definition: “In computer science, cache coherence is the consistency of shared resource data

that ends up stored in multiple local caches.” (Wikipedia!)

Example: assume that the blue table shows the memory, the numbers in the left are address of

each memory block and A, B, and so on are value of each block. Green tables are 3 local caches.

Each cache contains copies of some memory blocks. As it could be observed in figure 1.a, all

copies of the same memory block has an equal value (consistent shared data).

In figure 1.b, the value of second block of cache 3 has been updated but this chance has not been

reflected to the memory and other caches which raises the cache coherence problem

Figure 1.a – Consistent shared data

Figure 1.b – Inconsistent shared data

http://en.wikipedia.org/wiki/Computer_science

http://en.wikipedia.org/wiki/Cache_%28computing%29

5

Cache coherence protocols are used to solve the cache coherency problem and keep the data

consistent among all caches and memory.

Cache coherence protocols maintain the coherence by implementing the following invariant:

Single Writer, Multiple Readers (SWMR) invariant: for every single memory location at any

given time, only one core can write to it (and maybe read it) OR one or more cores can read from

it.

Cache coherence protocols uses finite states machines to implement the SWMR invariant. Each

coherence controller implements a set of finite state machines per block (in all sources that we

found, cache coherence protocols maintain the coherence at level of each blocks not bits).

Figure 2 – Cache coherence structure

As it could be observed in figure 2, each cache controller has two sides:

1. Network side: interfaces to rest of the system using the interconnection network. For example,

if a cache miss occurs, the cache controller issue a coherence message to other caches or

memory (according the protocol type) and receive the data or could send data to other cores or

memory.

2. Core side: interfaces the core. Receives load and store requests from the core and sends

/receives data to/from core.

Memory controller is similar to the cache controller, except it does not have the core side.

6

Design Space of a Cache Coherence Protocols Each protocol has 4 main elements.

States

The state of each cache block consist of below properties:

Validity: A valid block has the most up-to-date value for this block. A valid block could be

read, but written if it is also exclusive.

Dirtiness: a cache block is dirty if its value is the most up-to-date value, and this value differs

from the value in the memory.

Exclusivity: a cache block is dirty if its value is the most up-to-date value, and this value

differs from the value in the memory.

Ownership: a cache controller (or memory controller) is the owner of a block if it is

responsible to for responding to coherence requests for that block. An owned block cannot be

evicted without giving the ownership to another block. In most protocols, there is exactly one

owner for each block.

Most protocols use a subset of the classic five state MOESI model (pounced MO-Zee) which are

called valid states. Each has different combination of properties, described above.

Modified: valid, exclusive, owned, and potentially dirty. May be read or written. The only

valid copy of this block. Should respond the requests for this block. The memory copy of this

block is potentially stale.

Shared: Valid, not exclusive, not dirty, and not owned. The cache has a read-only copy of this

block. Other caches might have valid, read-only copies of the block.

Invalid: the block is invalid. The cache either does not contain the block or have a stale version

of it. It may not be read or written.

Owned: the block is valid, owned and potentially dirty but not exclusive. The cache has a read-

only copy of this block and should respond to the requests for this block. The memory copy is

potentially stale.

Exclusive: valid, exclusive, and not dirty. The cache has a read-only copy of this block. The

memory copy of this block is up-to-date.

Transient states occur during the transition from one stable state to another one. XYz means that

the block is transition from stable state X to stable state Y and the transition will not be complete

until an event of type Z occurs. As an example, IMD denotes that a block was in the “I” state and

will become in the “M” state when data (D) is received.

There are 2 general approaches to naming states of blocks in the memory. The choice of the

naming does NOT affect the functionality or performance.

Cache-centric: the state of block in the memory is an aggregation of the block in the caches.

For example, if a block in all caches is in state “I”, the memory state for this block is “I”. If

7

one or more copies are in “S”, then the block in “S” in memory. If block in one cache is in

state “M”, it is in “M” in memory.

Memory-centric: the state of the block corresponds to the memory controller's permission to

this block. For example, if all if a block in all caches is in “I”, the memory state for it will be

“O” because the memory will behave like its owner. If they are all in “S” the memory state

will be “O”. If the block is in “M” or “O” in one cache, then its memory state will be “I” since

the memory has the invalid copy.

To maintain the state of blocks in caches, the most common way is to add some extra bit at the

end of each block. For example, in MOSEI we need 3 bits to show the state.

To maintain the state of blocks in memory, we can use the same approach. Alternatively, we can

use logical gates. For example, as depicted in figure 3, a NOR gate could be used to determine the

block state in memory.

Figure 3 – Using logical gates to determine states of the memory blocks

8

Events

Events are core requests to their cache controllers. In table 1, common events and their meaning

are shown:

Table 1 – Common coherence events

9

Transactions

Transactions are all initiated by cache controllers that are responding to requests from their

associated cores. Most protocols have a similar set of transactions, because the basic goals of the

coherence controllers are similar.

Table 2 – Common coherence transactions

Invalidate vs. Update

The other major design decision in a coherence protocol is to decide what to do when a core writes

to a block. There are two options:

Invalidate protocols: when a core wishes to write to a block, it initiates a coherence

transaction to invalidate the copies in all other caches. Thus; if other cores want to read this

block, they need to issue a new request to obtain a new copy of this block.

Update protocols: when a core wishes to write a block, it initiates a coherence transaction to

update the copies in all other caches to reflect the new value it wrote to the block.

Tradeoffs:

Update protocols reduce the reading latency.

They use more bandwidth since their messages are bigger (carry data as well).

10

Protocol Types Cache coherence protocols could be classified in 3 groups:

Snooping protocols

Idea: all coherence controllers observe (snoop) coherence requests in the same order. By

requiring that all requests to a given block arrive in order, a snooping system enables the distributed

coherence controllers to correctly update the finite state machines that collectively represent a

cache block’s state.

Traditional snooping protocols broadcast requests to all coherence controllers, including the

controller that initiated the request. The coherence requests typically travel on an ordered broadcast

network, such as a bus.

Directory protocols

Idea: a coherence request is issued by a cache controller for a block and unicasted to the memory

controller that is the home for that block. Each memory controller maintains a directory that holds

state about each block in the memory.

Hybrid protocols

Hybrid protocols are a combination of snooping and directory protocols. Snooping protocols are

simpler for implementation than directory protocols but because of the large number of messages

that snooping protocol exchange, they cannot be scaled up to the high number of cores.

Hybrid protocols are designed to take advantage of good properties of each type.

11

Snooping Protocols The baseline system that is used in our project, could be observed in figure 4. As it could be

observed, each core has its own cache and cache controller. Cache controllers are able to

communicate to each other and to memory using the interconnection network. We assume that,

the interconnection network is in charge of ordering the coherence messages.

Figure 4 – the baseline system

12

MSI Protocol

MSI is the simplest snooping protocol and has just three states: Modified, Shared and Invalid. In

the initialization phase, all cache blocks are in state I and all memory block are in state M (since

the memory is the owner).

We assume that MSI implements two atomicity properties:

Atomic requests: states that a coherence request is ordered in the same cycle that it is

issued.

Atomic transactions: states that coherence transactions are atomic in that a subsequent

request for the same block may not appear on the bus until after the first transaction

completes (i.e., until after the response has appeared on the bus).

In figure 5, you could observe the state diagram of the MSI protocol and its corresponding table

in table 3.

Figure 5 – MSI cache controller state diagram

13

Table 3 – MSI cache controller table

For example, if the cache contains a copy of the memory block in the invalid (I) state and core

sends a load request for that block, the cache controller will issue a GetS request and broadcasts it

to all other controllers and update the cache block state to ISD and waits (stalls) until the requested

data is received. When data is received, the cache controller updates the block state to S.

The cells that are marked with (A) means that they are forbidden because of the atomic transactions

property.

In figure 6 and table 4, you could observe the state diagram and its corresponding table of MSI

memory controller, accordingly.

Figure 6 – MSI memory controller state diagram

14

Table 4 – MSI memory controller table

Advantages:

Small table and few possible states.

Easy to understand and implement

Multiple copy of a same block could be available because of the shared state.

Disadvantages:

Many impossible states due to atomic transaction property and many stalls:

Lower throughput

Higher latency

Unnecessary broadcast of invalidate messages: when a core wants to write on block should get

the block in the stat M and send an invalidate message to all other cores, no matter if it is the

only copy of that block or not.

Tradeoffs: downgrade from M to S or I? We need to predict if block is going to be used again

or not.

15

MESI Protocol

Implements atomic transactions and non-atomic request properties.

The Exclusive state is used in almost all commercial coherence protocols because it optimizes a

common case: a core first reads a block and then subsequently writes it.

In MSI, a core needs to issue a GetS message to get the read permission (in case a cache

miss) and then have to issue a GetM message to get the write permission.

In MESI, a core can get the block in the exclusive state and no other block can access it anymore.

Thus, the core does not need to issue a GetM message.

In figure 7 and table 5, you could observer the MESI cache controller state diagram and its

corresponding table, accordingly.

Figure 7 – MESI cahce controller state diagram

16

Table 5 – MESI cache controller table

As mentioned, we assumed that MESI does not implement the atomic request property. Thus, ISAD

means that the cache controller has issued the GetS request and updating the cache block state

from invalid to the shared but is not ordered yet. As soon as its request is ordered, its state will be

updated to ISD and it will wait to receive the requested data. Finally, when the data is received the

state will be update to the shared state.

In figure 8 and table 6, you could observe the state diagram and its corresponding table of MESI


17

Figure 8 – MESI memory controller state diagram

Table 6 – MESI memory controller table

Advantages:

Silent transition from the exclusive state to the modified/shared state. No unnecessary

invalidate messages are issued.

Read and write with issuing only one request.

Fewer number of messages.

Less traffic on the bus, lower bandwidth usage.

18

Disadvantages:

Extra hardware is needed to implement the exclusive state.

MOSI Protocol

When a cache has a block in state M or E and receives a GetS from another core, if using the MSI

protocol or the MESI protocol, the cache must

change the block state from M or E to S

send the data to both the requestor and the memory controller

Questions raise that how a snooping protocol can minimize accesses to memory or eliminate:

The extra data message to update the memory when a cache receives a GetS request in the

M (and E) state?

The potentially unnecessary write to the memory?

The key difference: when a cache with a block in state M receives a GetS from another core,

In a MOSI protocol, the cache changes the block state to Owned (instead of S) and retains

ownership of the block (instead of transferring ownership to the memory) The O state enables

the cache to avoid updating the memory. Thus, when a controller requests a block, if one the

caches has the block in state O will send it to the requestor and there is no need to load if from

the memory.

In figure 9 and table 7, you could observer the MOSI cache controller state diagram and its

corresponding table, accordingly.

Figure 9 – MOSI cache controller state diagram

19

Table 7 – MOSI cache controller table

Differences with the MESI protocol are shown in red.

In figure 10 and table 8, you could observe the state diagram and its corresponding table of MOSI


Figure 10 – M0SI memory controller state diagram

20

Table 8 – MOSI memory controller table

Advanced Snooping protocols

Pipelined (non-atomic) bus: we could send multiple coherence request without waiting for the

response. However, responses should be received in order. For example, as illustrated in figure 11,

the data response of request 1 should be received before the data response of request 2.

Figure 11 – Pipelined (non-atomic) bus

Split Transaction (non-atomic) bus: provides responses in an order different from the request

order. Thus, this method increase the system throughput since controllers do not need to wait for

each other, see figure 12.

Figure 12 – Split Transaction (non-atomic) bus

For more info about the MSI protocol with split-transaction bus, please refer to our slides.

21

A Case Study: Sun Starfire E10000 Uses MOESI.

Non-atomic requests and transactions.

Supports up to 64bit processors.

Wired snooping busses consume lots of energy; thus, they do not scale up to large number of

cores. To solve this problem. E10000 uses point-to-point links instead.

Uses a separate bus for sending out-of-order data response messages.

In this system, all controllers unicast their coherence requests to the root node and the root node

will order the request and broadcast them to all other controllers. You could see a high-level

architecture of this system in figure 13.

Figure 13 – Sun Starfire E10000 message passing

22

Snooping Simulation We implemented 3 different cache coherence simulations. One for snooping protocols in C# and

2 for directory protocols in Java and C++ according to our skills in each programming language.

We tried our best to follow the same algorithms, using the same parameters, inputs and so on

(especially in the C# in Java version). However, we did not have enough time to match everything

but main elements of both simulations are the same.

How to Run?

If you would like to view and edit the code, you will need Microsoft Visual Studio 2013. Follow

the below path:

Team 1 Final Report (Root Folder) Code Snooping Simulator Snooping CSCI5593

If you would like to execute the code, you will need .NET Frameworks 4.5. Follow the below

path:

Team 1 Final Report (Root Folder) Code Snooping Simulator Snooping.exe

Parameters

You could observe a screenshot of the application (input entry) in figure 14.

Hardware Parameters:

Number of cores

Cache and memory latency (cycles)

Memory and cache size (number of blocks): we assumed that memory and cache blocks has the

same block size. Thus, we only need the number of blocks to calculate the memory and cache size.

Input Parameters:

Input size (number of load/store requests for each core): for each core we generate a series of

load/store instructions. For example if the input size is 100, 100 load/store instruction will be

generated for each core.

Store percentage (distribution of load vs store request): For example if it is set to 40, 40% of the

request will be store and 60% of them will be load request.

Larger input size / smaller memory higher probability that cores need a block at the same time

Larger store percentage more conflicts more stalls

23

Figure 14 – Parameter entry

Inputs

We tested our simulator using 12 different inputs to analyze the behavior of each protocol under

different conditions. We got the idea of having different inputs from the SPLASH-2, which a well-

known and widely used simulator. To ease the testing task for you, you could simply choose one

the inputs from the Test Name drop down list and values will be loaded automatically into fields.

You could see the specifications of each input in table 9a and 9b.

Table 9a – Input L1 to H3

As you can observe, all values are the same in all fields and just differ in input size and store

percentage. This helps us to see the effect of a single property on each protocol. L, M and H stand

for Light, Medium and Heavy accordingly which refer to the size of input.

24

Table 9b – Input MC1 to MW3

Metrics

We keep track of the below parameters in our application. Since we did not have enough time to

analyze all of them one by one, we just focused on the number of write-backs and the number of

invalidate messages which are used in our analysis.

Per core:

Write-backs

Memory reads

Invalidate messages

Coherence messages (broadcasting between cores)

Memory messages (coherence messages sent to the memory)

Data responses

Stalls

Cache hits

Cache misses

Replacements (evictions): when cache is full

Per Protocol:

Write-backs

Invalidate messages

Coherence messages (broadcasting between cores)

Memory messages (coherence messages sent to the memory)

25

All messages

Memory references (read/write from memory)

Stalls

Cache hits

Cache misses

Code

In this section, some high level information about the code is presented.

Input Folder:

This folder contains 4 classes:

InputElement.cs: this class defines the data structure of our input elements. Each input

element has 3 fields: _Command, _Core, and _BlcokID

Example: Core #2 needs to load the block which is originally located in the 25th block of

memory (its copies could be contained in caches!)

public string _Command = “Load"; public int _Core = 2; public int _BlockID = 25;

Generator.cs

This class generates the input (input elements) according to the input size and store

percentage parameters.

InputTest.cs

Generates an input using the Generator.cs class and outputs values. You could see a generated

sample by this class:

Load 3850 2 Load 5207 6 Load 7230 4 Store 4374 3 Store 5998 5 Load 4247 3 Store 7729 1 Load 1040 0 Store 862 2 Store 4738 4 Load 2152 7 Load 6976 1 Store 8759 6 Store 8347 3 Load 7171 0 ..

Tests.cs

Includes 12 predefined configurations, as mentioned in the input section, that we used to test

our simulator.

26

Cores Folder:

Core.cs

We use this class for keeping track of our metrics for each core, such as the number of write-

backs, invalidate messages and so on.

CacheBlock.cs

Each cache block has 3 fields:

MBlockID: the address of the cache block.

State: shows the state of the block in cahce: M, S, E, O, I or X as empty

Dirty: indicates if the block is dirty or not.

Protocols Folder:

MSI.cs, MESI.cs, and MOSI.cs which contains the implementation for each protocol. Some

common methods of these classes are:

Load: loads a block into cache.

Evicts: if the cache is full, this method chooses and evicts it.

UpdateState: updates the state of a block. For example: M S

OWNE(BlockID = 1, Core = 2) : returns true if the cache 2 has a copy of the memory

block with id = 1 in the E state.

Two other classes are just some data structures that are used in the code.

Statistics Folder:

Contains two classes Sumarry.cs and ProtocolStatistics.cs.

These classes calculate the number of messages, cache hits and so on for each protocol.

For example:

Cache Hits = cache hits of cache 1 + cache hits of cache 2 + …

27

Results & Analysis As we figured out, the output results are highly dependent to the input. So, it is possible that you

see different results, using the same input configurations. Thus, we executed each test around 10

times and computed the average values. However, for the H (heavy) inputs, the runtime will be so

long (even hours!) so we ram them 1-2 times only.

Experiment 1: Number of invalidate messages In MSI & MESI

Motivation: as mentioned in previous sections, one of the main design goals of the MESI protocol,

was to reduce the number of invalidate messages. The reason for this reduction is that when a

cache contains a block in state E, it does not need to broadcast the invalidate messages when its

state is upgraded to M. As you could observe the results of this experiment in figure 15, in all cases

(except L1) MESI always broadcast fewer number of invalidate messages.

We first thought the results of L1 happened because of the bug in our code but when we checked

the MESI state table, we noticed that it broadcasts the invalidate message when a block

downgrades from E to I or S (receiving a GetS request from another core). Thus, there are some

invalidate messages even when there is no conflict (write conflict) in the input.

Figure 15 – Number of Invalidates in MSI vs. MESI

28

Experiment 2: Number of write-backs in MSI & MOSI

Motivation: as mentioned in previous sections, the design goal of the MOSI protocol is to reduce

the number of write backs from memory compared to MSI. The reason is that when a controller

downgrades a block state from M to S or I due to a GetS request of that block from another core,

the new core becomes the owner of that block and there will be no need to write-back the block

on memory. As figure 16 shows, the number of write-backs in the MOSI was always less than

MSI. However, the difference was not significant.

Figure 16 – Number of write-backs in MSI vs. MOSI

29

Experiment 3: Effect of Cache Size on the Number of Write-Backs

Motivation: As we studied in this course, increasing the size of cache would reduce the number

of evictions and write-backs since the cache could fit more memory block in it and improve the

locality as well. We tested this using three different inputs and in each one multiplied the size of

cache by 10. In MC1, MC2 and MC3 each cache has 10, 100 and 1000 blocks accordingly. As it

could be observed in figure 17, the number of write-backs reduces as the cache size increases.

Figure 17 – Write-back reduction vs cache size

30

Experiment 4: Write-back Reduction vs. Number of Cores

Motivation: as mentioned in previous sections, the design goal of the MOSI protocol is to reduce

the number of write backs from memory compared to MSI. We expected that as we increase the

number of cores, MOSI becomes more and more effective in terms of broadcasting the number of

write-backs compared to MSI.

To see the effect of this, we normalized values and contrary to our expectation, the ratio of number

of write backs almost stayed the same while we added the number of cores (as you could observe

in figure 18). The exact reason of this is not clear for us. It could be a bug in our code or our

assumption could be wrong.

Figure 18 – Write-back reduction vs # of cores

31

Directory Protocol

The baseline protocols:

The key role of the directory is to maintain the state and data of every block on every caches. To

do that, the directory needs to listen all the request made by caches controller, process them, and

send back to the cache controller to ask for action. For example, a cache controller that wants to

issue a coherence request, it will send a GetS message directly to the directory using unicast (1-to-

1) method, then, the directory looks up the state of the block to determine what actions to take

next. Then, the directory indicate that the request block is owned by a cache in core A, it will

forward the request to that cache to respond to the requester. Note that all the communication are

done by unicast (1-to-1) method. The infrastructure of a directory based cache coherence is as in

Figure 14 There are multiple cores with corresponding cache controller. All of the cache controller

connect to the interconnection network and send the request to directory component located on the

memory.

Figure 14. The outline framework of directory-based cache coherence protocol.

We now describe the baseline system to give some basic information about how the directory

coherence work with three traditional state of cache: MSI. The blocks state would be changed from

one state to another. The algorithm for changing that state are as following:

From I to S: the cache controller sends a GetS request message to the directory

and change the block state from I to ISD. According to receiving the request from

that requestor, the directory sends back the data message and change the block state

that its own to S. When the data arrives the requestor, the cache controller change

the block’s to S, the transaction has been done successfully. Secondly, if the block

32

that holds the data is not the owned block of directory, the directory will forward

the request to the owner and changes the block’s state to the transient state SD. The

owner responds to this FwD-GetS by sending data to the requestor and chaing the

block’s state to S. Now, the previous oner need to send the data to the directory

since it should belong to the directory. When data is arrived the directory, the

directory copies it to memory, changes the block state to S, and considered the

transaction complete. Figure … illustrates above explanation.

Figure 15. Possible the transition from I state to S states

From I or S to M:

The cache controller sends a GetM request to the directory and change the block’s

state to IMAD. If the directory owns the blocks that is the same with the requestor, directory

will send data back to the requestor immediately. However, in the case the directory doesn’t

own the blocks that requestor requests, it will forward to the node that owns that data. That

data then would be forwarded from the owner to the requestor. Figure below illustrates two

different cases of changing state from I or S to M.

Figure 16. Possible the transition from I state to M states

From M to I.

The cache controller first sends a PutM request including the data that it wants to

modify and change the block state to MIA. When the directory receives this

PutM, it will update to the memory, respond by Put-Ack message. Once the

requestor receives the Put-Ack, the block’ state change to I. As the same in

previous case, if the cache controller receives a forwarded coherence request

33

(Fwd_getS or Fwd-GetM) between sending the PutM and changes its blocks to

SIA or IIA. The figure below illustrates how the messages and state change.

Figure 17. Possible the transition from M state to I states

From S to I.

To replace a block in state S, the cache controller will send the PutS message and

wait for the Put-Ack message from directory. Until the requestor receives the Put-

Ack message from the directory, it will change the state to I.

Figure 18. Possible the transition from S state to I states

Adding Owned State (MOSI protocols)

1 - Overview

In this section, we will present the conceptual idea of MOSI directory-based protocol together with

the implementation in detail. As the same in snooping protocols, when adding Owned(O) state,

the block is then valid, read-only, dirty (it must eventually update to memory), and own (the cache

must respond to coherence request for the block). Comparing with MSI in directory-based

protocol, adding owned state create three main changes:

1. More coherence request are satisfied by caches (in O state) than by the LLC/memory

2. There more 3-hop transactions, which is more immediate transactions in between.

The key different here compared with MSI baseline protocol is that the transaction in which a

requestor of a block in state I or S sends a GetM to the directory when the block in the O state in

34

the owner cache and in S state in the one or more sharer caches. In this case, the directory

forwards the GetM to the owner and appends the AckCount. The owner receives the Fwd-getM

and responds to the requestor with Data and the AckCount. Moreover, this protocol has a PutO

transaction that is nearly identical to the PutM transaction. It contains data for the same reason

that PutM transaction contains data, i.e, because both M and O are dirty states. The figures

below show the additional states and messages by adding O state. And the Table 9 shows the key

different between MSI and MOSI, the states and action by adding O state are highlighted with

yellow color.

Figure 19. Possible the transition from I state to S states

Table 9. Possible transition and state of MOSI protocol. The yellow cell illustrates the states that

only available in MOSI, not MSI protocol

35

2 - Distributed Directories. There is an approach to improve the performance of directory

protocol by preventing bottleneck situation, where all the requests from caches controllers go to

a single directory. We will use this approach for our implementation.

Figure 20. The outline of our implementation of cache coherence protocols.

3 - Non-Stalling Directory Protocols:

We recall that one of the limitation of directory protocols is that the stall situation happens

frequently. In other words, when a cache controller has a block in state IMA and receives a Fwd-

GetS, it processes the request and changes the block’s state to IMAS. This state indicates that

after the cache controller’s GetM transaction completes (i.e., when the last Inv-Ack arrives), the

cache controller will change the block state to S. The cache controller must also send the block to

the requestor of the GetS and to the directory, which is now the owner. Therefore, by not stalling

on the Fwd-GetS, the cache controller can improve performance by continuing to process other

forwarded requests behind that Fwd-GetS in its incoming queue.

4 - Case Studies: SGI Origin 2

Flat memory-based directory protocol

Uses a bit vector directory representation

Consists 512 nodes

Two processors per node, but there is no snooping protocol within a node –combining

multiple processors in a node reduces cost

Distinguishing Features

36

- As its scalability, each directory entry contains fewer bits than necessary to present

every possible cache that could be sharing a block.

- Directory dynamically choose coarse bit vector or limited pointer presentation

- Since network provides no ordering, there are several new messages have been used

for reordering purposes

- Protocol considers all of these conditions by not enforcing ordering in the network

- Use only two networks request and response to avoid deadlock. Note that directory

has three types of message (request, forwarded request and response)

Figure 21. SGI Origin 2 structure.

After finishing our study, we come up with some advantages and disadvantages as follow:

Advantages

Supports scalability

Able to take care of ordering messages

Disadvantages

More complicated than Snooping

Has many transactions -> inefficient in time as they require an extra message when the home

is not owner

High storage overhead of directory data structure

Implementation plan: The simulator was written in C++ for distributed directories cache coherence protocol. We try our

best to map from the concept of the protocols to the real environment. There are some assumptions

that we have been made for this implementation:

+ Number of cycle for executing every operation: 1 cycle

+ Number of directory = number of caches

37

Based on our understanding and analysis about the cache coherence problem, we come up with

some evaluation metrics as in following:

- Number of write backs vs. cache size and block size

- Number of write backs vs. cores

- Number of stalls vs. scores

- Number of cycles vs. scores

- Number of hits and misses vs. scores

We now show the implementation process together with the results that we get.

Implementation We first define the message and data type that we will use using typedef in C++. The figure below

show the type of message sending by cache controller to directory together with its change to

prevent cache coherence.

The following explanation of the simulation code is for MOSI, the same is applied for MSI. We

just need to remove the O state.

Input data:

The information that user need to input include:

+ Number of cache

+ Number of blocks

+ Number of requests

+ Number of cores

However, the number of cores need to be input manually by copy the following code blocks:

//core 1 for (int j = 1; j < requestnum; j++){ for (int i = 1; i < blocknum*cachenum; i+=2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(LOAD, status); } for (int i = 2; i < blocknum*cachenum - 1; i += 2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(STORE, status); }

38

The key components of our implementation are messages and states. Hereafter are the format

of message and the states that we used in our implementation.

The message from direction are contains information about LOAD, STORE, DATA, ACK, FWD

messages, …

typedef enum { NOP = 0, LOAD, // load message STORE, // store message DATA_FROM_DIR_NO_ACK, // data message from directory to requestor

//without acknowledgement DATA_FROM_DIR_ACK, // data from directory to requestor // including acknowledgement DATA_FROM_OWNER, // data to owner to requestor DATA_FROM_NON_OWNER, // data not from the owner ACK_COUNT, // ack count INV, // invalid message REPLACEMENT, // replacement message FWD_GETS, // forward GetS message from directory FWD_GETM, // forward GetM message from directory PUT_ACK, // Put Ack INV_ACK, // Invalid Ack LAST_INV_ACK // last Invalid Ack } messages; // message from directory

Meanwhile, the message sent from cache controller to directory is a bit simpler:

typedef enum { GETS = 1, GETM, PUTS, PUTM, PUTO, DDATA_FROM_OWNER, DDATA_FROM_NON_OWNER }dmessages; // message to directory

Hereafter is the list of all possible states on of the cache:

typedef enum { MOSI_M = 1, // M state MOSI_O, // O state MOSI_S, // S state

39

MOSI_I, // I state MOSI_ISD, // ISD state // transition from I to S state and wait for data MOSI_IMAD, // IMAD stat //transition from I to M state and wait for data and ack MOSI_IMA, // IMA state MOSI_SMAD, // SMAD state MOSI_SMA, // SMA state MOSI_MIA, // MIA state MOSI_OMAC, // OMAC state MOSI_OMA, // OMA state MOSI_OIA, // OIA state MOSI_SIA, // SIA state MOSI_IIA // IIA state } state_t;

Classes:

We use a number of classes as following, the function of each class is provided on the same line

of the function name with block commen (//)

void MOSI_protocol_cache_request(messages, state_t); //cache controller sends request to the directory void MOSI_protocol_directory_request(dmessages, state_t); // directory responds upon receiving the message void I_state_cache(messages,state_t); //what the cache need to do when it is in I state void Transition_I_to_SD(messages, state_t); // transition state from I to S, wait for data D to complete void Transition_I_to_MAD(messages, state_t); // transition state from I to M, wait for data D and the acknowledgement A to complete. void Transition_I_to_MA(messages, state_t); // transition state from I to M, wait for the acknowledgement A to complete. void S_state_cache(messages, state_t); // cache in S state. void Transition_S_to_MAD(messages, state_t); // transition state from S to M, wait for data D and the acknowledgement A to complete. void Transition_S_to_MA(messages, state_t); // transition state from S to M, wait for the acknowledgement A to complete. void M_state_cache(messages, state_t); // cache in M state void Transition_M_to_IA(messages, state_t); // transition state from M to I, wait for the acknowledgement A to complete. The similar explanation are for the rest of the function.

40

void O_state_cache(messages, state_t); void Transition_O_to_MAC(messages, state_t); void Transition_O_to_MA(messages, state_t); void Transition_O_to_IA(messages, state_t); void Transition_S_to_IA(messages, state_t); void Transition_I_to_IA(messages, state_t); void X_state_cache(messages, state_t); void directory_I(dmessages, state_t); void directory_S(dmessages, state_t); void directory_M(dmessages, state_t); void directory_O(dmessages, state_t); void send_GET_to_Directory(dmessages, state_t); void send_Data_to_Requestor(messages, state_t); void send_Data_to_Directory(); void checkOWNER(dmessages, state_t); void checkGETS(state_t); void checkGETM(dmessages,state_t); void checkPUTS(dmessages, state_t); void checkPUTM(dmessages, state_t); void checkPUTO(dmessages, state_t); void copy_Data_to_mem();

We also present an example implementation of a MOSI cache request is as in Figure 22.

Note that depending on each state that the cache is staying, it will base on the message

received and do some corresponding actions.

Figure 22. Cache request implementation on cache controller

In addition, the figure 23 shows the action of a cache in I state. It needs to check the request from

directory, if the request is LOAD, it will send a message GetS to directory to request for changing

the state. Then the directory need to execute that request from the requestor and so on.

41

Figure 23. Cache controller implementation when the cache is in I state

The following figure is how a transition has been implemented.

Figure 24. Cache controller implementation when the cache is in I state

Then, the directory should be executed some functions to process the request, and Figure 25

shows how it was implemented.

42

Figure 25. Directory respond implementation

In order to verify the results, we need to manually test the state of the cache. Let’s do the test

with sending a change S request from I to S state, the output of the program should be as

following.

Figure 26. An example of debugging

As mentioned, we evaluate the system when varying different matrix and check the results. The

key problem of the simulation is that it generates the results differently according to the output

data. We are still figure out how to fix that problem.

43

How to run the code: This project has been developed in C++ language.

1 - Please download Microsoft Visual Studio to compile the project.

2 – The procedure of compiling the project is as following:

a. Open Visual Studio - > Select open project go to directory of our project

Figure 26. Microsoft Visual Studio interface for opening new project

b. Select CAProject.sln

c. If you want to run MOSI protocol simulation. Select Solution Explorer (on right

corner) -> Source Files -> right click -> Add -> Existing item … -> Navigate to

MOSI_protocol.cpp.

You can do similar steps for opening MSI protocol. Note that you have to run one at

every time

d. Press Ctrl + Shift + B at the same time to compile the project

e. Press Ctrl + F5 to run project. Input number to run the project.

44

Important: if you want to change the number of core, you have to copy the code following code

to the number of core that you want. This is inconvenient but we have no solution for this right

now.

// Core number

for (int j = 1; j < requestnum; j++){ for (int i = 1; i < blocknum*cachenum; i += 2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(LOAD, status); } for (int i = 2; i < blocknum*cachenum - 1; i += 2){ state_t data2 = (state_t)(1 + rand() % 11);; sta = translate_to_Readable_state(data2); state_t status = mapping_to_state(sta); MOSI_protocol_cache_request(STORE, status); } }

Results & Analysis 1 – Number of write back vs. cores:

We change the number of core input and check the number of write backs made by MOSI and

MSI. As we add new state O reduces the number of write back to the memory, we achieve

significant improvement compared with the base solution MSI.

Figure 27. Checking the number of write back when varying the number of cores

0 500000 1000000 1500000 2000000 2500000

2 cores

4 cores

8 cores

16 cores

Number of write backs vs. number of cores

MOSI MSI

45

2 - Number of stalls vs. cores:

Adding more state mean increasing the number of transitions between states. As we mention, the

stall is happened at those movement where transition is on the way of its processing. However, as

we can see the increasing of stall are not too many. This is really a trade of for selecting the

protocol, we accept to lose the stall, but we can reduce the number of write backs to memory.

Figure 28. Checking the number of stall when varying the number of cores

3 - Number of cycles vs. cores:

We attempt to calculate the number of cycles because we want to find how many cycles we need

for adding one more state, “O” state. The result below shows that MOSI spend more cycles than

MSI. However, after thinking more carefully, we are assuming all operations are done in 1 cycle,

while it is not in reality. For example, the write back cycle should be 100 and the other things

should be done in 3-5 cycles.

Figure 29. Checking the number of cycles when varying the number of cores

0

5000000

10000000

15000000

2 cores 4 cores 8 cores 16 cores

Number of stalls vs. cores

MSI MOSI

0

10000000

20000000

30000000


Number of cycles vs. cores

MOSI MSI

46

More importantly, we also calculate the number of hits and misses that obtained by applying MOSI

and MSI protocol. The results shows that MOSI is also achieved the better performance in term of

hits and misses. This also reflects a positive side of putting Onwer state.



Summary:

+ We need to spend more stall by adding O state, however, we reduce the number of

write backs, which is more efficient for the system. However, the simulator depends on the input

parameters, which lead to unpredictable results. The implementation of MOSI and MSI have

been developed in C++. In order to run MESI, you need to use JAVA to compile the code. The

next section will cover this information.

0

2000000

4000000

6000000

8000000


Number of hits vs. cores

MOSI MSI

0

2000000

4000000

6000000

8000000


Number of misses vs. cores

MOSI MSI

47

Adding Owned State (MOSI protocols)

Motivation: - In MESI core takes only one coherence transaction to read and write into a

block but in msi to read and write it took two coherence transactions.

Mechanism: -If GetS is requested by one core to a block and if that block is not shared by other

cores then requester core can obtain a block in state E. Here the core can silently upgrade from

state E to M without issuing another coherence request. Here we have a question from the snooping

MESI protocol that we have to make E as owner or not. In this protocol solution is simple that E

is owner here, so the cache with block E has to respond to request. Because cache has to send PutE

request to directory to know that directory is now owner for that block to respond to incoming

requests.

Consider the block is not owner then the protocol complexity increases. For example if we consider

three cores then the first core block is in E state and at the same time if the directory gets and

request from second core either GetS or GetM at that time directory thinks that first core is in any

of E, M or I. If it is in M then directory forward request to first core then core one respond to

request. Similarly if it is in E state then core one or directory can respond to request and in I state

directory only wants to respond.

Diagrammatic representation of mechanism explained above

This protocol is somehow more complex than MSI protocol which added complexity at directory

controller in additional to more states the directory controller must distinguish between more

possible events. The complexity from events is when PutS arrives the directory must distinguish

weather this PutS is last PutS or not if it last puts then directory has to change to state I.

Implementation of MESI_DIRECTORY

The main thing in implementing is each block in memory there is corresponding directory entry.

Each and every time directory has to respond to the request. Directory can maintain each state of

the block. Everything has to check with directory for the block state weather it is in I,E,M,S.Every

time cache process the request to directory and directory centralized the request responds based on

48

the current state of block. I implemented the all the states with GetS and GetM in directory

controller and load and store in cache controller.

Simulation of MESI_DIRECTORY

I implemented the code for MESI directory protocol using java and the parameters is number of

cores on based on number of core the miss and hit of data is calculated. If the requestor get data

back to then hit will increases if not that mean the block state not allowed to respond to requested

state then it count as miss. We are incrementing the cycles each and every time when the loop

executes then we can know how many time the loop iterate. MESI_CACHE method in code

completely deals with cache controller and MESI_DIRECTORY completely deals with directory

controller.

Source code: Download and install Java SE Development kit from.

o https://www.oracle.com/java/index.html

Download and install the Eclipse IDE for Java Developers.

https://eclipse.org/

Open Eclipse and import the provided code into Eclipse.

Complete the requested parts of the provided code that includes

ADVANCED COMPUTER ARCHITECTURE

IMPLEMENTATION OF MESI_DIRECTORY PROTOCL

Download the MESI_DIRECTORY_PROTOCOL_SOURCE_CODE in that folder

open the src folder the you will find MESI_DIRECTORY3 file please execute it

using java6 or above the parameters are number of cores(number) then you will get

hits, miss and cycles total.

To compile and run MESI_DIRECTORY3.java you can use eclipse IDE”

In code file you will find MESI_DIRECTORY_CAHE and MESI_DIRECTORY which

contains the each and every state procedures for example in MESI_DIRECTORY I state

procedure represent as I_state_directory(request) and in MESI_CACHE

I_state_cache(request).

https://www.oracle.com/java/index.html

https://eclipse.org/

49

Step by step procedure to Write and implement java code in Eclipse,

o Open Eclipse.

o From the top menu click File on ---> New --- > Java Project on.

o Choose a name for your project and click on the Finish button.

o Add MESI_DIRECTORY3.java file

o Right click on the class and run as java application.

50

References [1] - Daniel J. S., Mark D. H. David A. W., “A Primer on Memory Consistency and Cache

Coherence,” Morgan Claypool Publishers, 2011.

[2] – Suleman, Linda Bigelow Veynu Narasiman Aater. "An Evaluation of Snoop-Based Cache

Coherence Protocols."

[3] – Tiwari, Anoop. Performance comparison of cache coherence protocol on multi-core

architecture. Diss. 2014.

[4] – Chang, Mu-Tien, Shih-Lien Lu, and Bruce Jacob. "Impact of Cache Coherence Protocols on

the Power Consumption of STT-RAM-Based LLC."

[5] – CMU 15-418: Parallel Architecture and Programming. Lecture Series. Spring 2012

Documents

CSCI 5593 Advanced Computer Architecture …cse.ucdenver.edu/~anhnguyen/CSCI_5593/Cache Coherence Simulation.pdfCSCI 5593 – Advanced Computer Architecture – spring 2015 Supervised