View
226
Download
0
Category
Preview:
Citation preview
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
1/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 1
Introduction
Embedded systems functionality aspects
Processing
processors
transformation of data
Storage
memory
retention of data
Communication
buses
transfer of data
Memory :Basic concept
Stores large number of bits
m x n: m words ofn bits each
k = Log2(m) address input signals
orm = 2^k words
e.g., 4,096 x 8 memory:
32,768 bits
12 address input signals
8 input/output data signals
Memory access
r/w: selects read or write
enable: read or write only when asserted
multiport: multiple accesses to different locations simultaneously
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
2/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 2
1.Write Ability and Storage performance
Traditional ROM/RAM distinctions ROM
read only, bits stored without power
RAM
read and write, lose stored bits without power
Traditional distinctions blurred
Advanced ROMs can be written to
e.g., EEPROM
Advanced RAMs can hold bits without power
e.g., NVRAM
Write ability
Manner and speed a memory can be written
Storage permanence
ability of memory to hold stored bits after they are written
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
3/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 3
write ability
Ranges of write ability
High end
processor writes to memory simply and quickly
e.g., RAM
Middle range
processor writes to memory, but slower
e.g., FLASH, EEPROM
Lower range
special equipment, programmer, must be used to write to memory
e.g., EPROM, OTP ROM
Low end bits stored only during fabrication
e.g., Mask-programmed ROM
In-system programmable memory
Can be written to by a processor in the embedded system using the memory
Memories in high end and middle range of write ability
Storage performance
Range of storage permanence
High end
essentially never loses bits
e.g., mask-programmed ROM
Middle range
holds bits days, months, or years after memorys power source turned
off
e.g., NVRAM
Lower range holds bits as long as power supplied to memory
e.g., SRAM
Low end
begins to lose bits almost immediately after written
e.g., DRAM
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
4/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 4
2.ROM:Read only Memory
Nonvolatile memory
Holds bits after power is no longer supplied
High end and middle range of storage permanence
Nonvolatile memory
Can be read from but not written to, by a processor in an embedded system
Traditionally written to, programmed, before inserting to embedded system
Uses
Store software program for general-purpose processor
program instructions can be one or more ROM words Store constant data needed by system
Implement combinational circuit
Example :8*4 ROM
Horizontal lines = words
Vertical lines = data
Lines connected only at circles
Decoder sets word 2s line to 1 if address input is 010
Data lines Q3 and Q1 are set to 1 because there is a programmed connection with
word 2s line
Word 2 is not connected with data lines Q2 and Q0
Output is 1010
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
5/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 5
Mask-programmed ROM
Connections programmed at fabrication
set of masks
Lowest write ability
only once
Highest storage permanence
bits never change unless damaged
Typically used for final design of high-volume systems
spread out NRE cost for a low unit cost
OTP ROM:One-time programmable ROM
Connections programmed after manufacture by user
user provides file of desired contents of ROM
file input to machine called ROM programmer
each programmable connection is a fuse
ROM programmer blows fuses where connections should not exist
Very low write ability
typically written only once and requires ROM programmer device
Very high storage permanence
bits dont change unless reconnected to programmer and more fuses blown
Commonly used in final products
cheaper, harder to inadvertently modify
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
6/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 6
EPROM: Erasable programmable ROM
Programmable component is a MOS transistor
Transistor has floating gate surrounded by an insulator (a) Negative charges form a channel between source and drain storing a logic
1
(b) Large positive voltage at gate causes negative charges to move out of
channel and get trapped in floating gate storing a logic 0
(c) (Erase) Shining UV rays on surface of floating-gate causes negative
charges to return to channel from floating gate restoring the logic 1
(d) An EPROM package showing quartz window through which UV light can
pass
Better write ability
can be erased and reprogrammed thousands of times
Reduced storage permanence
program lasts about 10 years but is susceptible to radiation and electric noise
Typically used during design development
Connections programmed after manufacture by user
user provides file of desired contents of ROM
file input to machine called ROM programmer
each programmable connection is a fuse
ROM programmer blows fuses where connections should not exist
Very low write ability
typically written only once and requires ROM programmer device
Very high storage permanence
bits dont change unless reconnected to programmer and more fuses blown
Commonly used in final products
cheaper, harder to inadvertently modify
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
7/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 7
EEPROM: Electrically Erasable Programmable ROM
Programmed and erased electronically
typically by using higher than normal voltage
can program and erase individual words
Better write ability
can be in-system programmable with built-in circuit to provide higher than
normal voltage
built-in memory controller commonly used to hide details from
memory user
writes very slow due to erasing and programming
busy pin indicates to processor EEPROM still writing
can be erased and programmed tens of thousands of times
Similar storage permanence to EPROM (about 10 years)
Far more convenient than EPROMs, but more expensive
Flash Memory
Extension of EEPROM
Same floating gate principle
Same write ability and storage permanence
Fast erase Large blocks of memory erased at once, rather than one word at a time
Blocks typically several thousand bytes large
Writes to single words may be slower
Entire block must be read, word updated, then entire block written back
Used with embedded systems storing large data items in nonvolatile memory
e.g., digital cameras, TV set-top boxes, cell phones
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
8/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 8
3.RAM
Typically volatile memory
bits are not held without power supply
Read and written to easily by embedded system during execution
Internal structure more complex than ROM
a word consists of several memory cells, each storing 1 bit
each input and output data line connects to each cell in its column
rd/wr connected to every cell
when row is enabled by decoder, each cell has logic that stores input data bit
when rd/wr indicates write or outputs stored bit when rd/wr indicates read
Basic Types of RAM
SRAM: Static RAM
Memory cell uses flip-flop to store bit
Requires 6 transistors
Holds data as long as power supplied
DRAM: Dynamic RAM
Memory cell uses MOS transistor and capacitor to store bit
More compact than SRAM
Refresh required due to capacitor leak words cells refreshed when read
Typical refresh rate 15.625 microsec.
Slower to access than SRAM
RAM VARIATION
PSRAM: Pseudo-static RAM
DRAM with built-in memory refresh controller Popular low-cost high-density alternative to SRAM
NVRAM: Nonvolatile RAM
Holds data after external power removed
Battery-backed RAM
SRAM with own permanently connected battery
writes as fast as reads
no limit on number of writes unlike nonvolatile ROM-based memory
SRAM with EEPROM or flash
stores complete RAM contents on EEPROM or flash before power
turned off
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
9/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 9
4.Scratchpad Memory
Embedded processor-based system
>Processor core
>Embedded memory
> Instruction and Data Cache
>Embedded SRAM
>Embedded DRAM
Scratch Pad Memory
>Design problems
1. How much on-chip memory?
2. Partitioning of on-chip memory in cache and scratchpad?
3. Which variables/arrays in the scratchpad?
Goals
> Improve performance
> Save power
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
10/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 10
Abstract
Efficient utilizationof on-chip memory space is extremely important in modern embedded
system applications basedon microprocessor cores. In additionto a data cache that
interfaceswith slower off-chip memory, a fast on-chip SRAM,called Scratch-Pad memory, is
often used in several applications. this present a technique for efficiently exploiting onchip
Scratch-Pad memory by partitioning the applications scalar and array variables into off-
chipDRAM and on-chip Scratch-Pad SRAM, with the goal of minimizing the total execution
time of embedded applications.
> Introduction
Complex embedded system applications typically use heterogeneous chips consisting of
microprocessor cores, along with on-chip memory and co-processors. Flexibility and short
design time considerations drive the use of CPU cores as instantiable modules in system
designs [5]. The integration of processor cores and memory in the same chip effects a
reduction in the chip count, leading to costeffective solutions. Examples of commercial
microprocessor cores commonly used in system design are LSI Logics CW33000 series
[3]and the ARM series from AdvancedRISC Machines [10].
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
11/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 11
Typical examples of optional modules integrated with the processor on the same chip are:
Instruction Cache, Data Cache, and on-chip SRAM. The instruction and data caches are
fastlocal memory serving as an interface between the processor and the off-chip memory.
The on-chip SRAM, termed Scratch-Pad memory, is a small, high-speed data memory that is
mapped into an address space disjoint from the off-chip memory, but connected to the same
address and data buses.
Both the cache and Scratch-Pad SRAM have a single processor cycle ac-cess latency,
whereas an access to the off-chip memory (usually DRAM) takes several (typically 1020)
processor cycles.
The main difference between the Scratch-Pad SRAM and data cache is that the SRAM
guarantees a single-cycle access time, whereas an access to cache is subject to compulsory,
capacity, and conflict misses.
When an embedded application is compiled, the accessed data can now be stored either in the
Scratch-Pad memory or in off-chip memory. In the second case, it is accessed by the
processor through the data cache. We present a technique for minimizing the total execution
time of an embedded application by a careful partitioning of scalar and array variables used
in the application into off-chip DRAM (accessed through data cache) and Scratch-Pad
SRAM.
Optimization techniques for improving the data cache performance of programs have been
reported [4, 7, 9]. The analysis in [9] is limited to scalars, and hence, not generally applicable.Iteration space blockingfor improving data locality is studied in [4]. This technique is also
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
12/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 12
limited to the type of code that yields naturally to blocking. In [7],a data layout strategy for
avoiding conflict misses is presented. However, array access patterns in some applications
are too complex to be statically analyzable using this method. The availability of an on-chip
SRAM with guaranteed fast access time creates an opportunity for overcomingsome of the
cache conflict problems (Section 2). Theproblem of partitioning data into SRAM and cachewith the objective of maximizing performance, which we address inthis paper, has, to our
knowledge, not been attempted before.
> Problem Description
Figure 1(a) shows the architectural block diagram of an application employing a typical
embedded core processor (e.g., the LSI Logic CW33000 RISC Microprocessor core[3]),
where the parts enclosed in the dotted rectangle are implemented in one chip, and whichinterfaces with an off-chip memory, usually realized with DRAM. The address and data
buses from the CPU core connect to theDataCache, Scratch-Pad memory, and theExternal
Memory Interface (EMI) blocks. On a memory access request from the CPU, the data cache
indicates a cache hit to the EMI block through the C HIT signal. Similarly, if the SRAM
interface circuitry in the Scratch-Pad memory determines that the referenced memory address
maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to
the EMI through signal S HIT. If both the cache and SRAM report misses, the EMI transfers
a block of data of the appropriate size (equal to the cache line size) between
the cache and the DRAM.
The data address space mapping is shown in Figure 1(b),for a memory of size data words.
Memory addresses0 . . .1 map into the Scratch-Pad memory, and have a single processor
cycle access time. Thus, in Figure 1(a), S HIT would be asserted whenever the processor
attemptsto access any address in the range 0 . ..1. Memory addresses . . . 1 map into the off-
chip DRAM, and are accessed by the CPU through the data cache. A cache hit for an address
in the range. . . 1 results in a single-cycledelay, whereas a cache miss, which leads to a
block transfer between off-chip and cache memory, results in a delay of10-20 processor
cycles.
Suppose the above code is executed on a processor configured with a data cache of size 1
KByte. The performance is degraded by the conflict misses in the cache between elements of
the two arraysHistandBrightnessLevel. Data layout techniques, such as [7] are not effective
in eliminating the above type of conflicts, because the accesses to Histare data-dependent.
Note that this problem occurs in both direct-mapped as well as set-associative caches.
However, the conflict problem can be solved elegantly if we include a Scratch-Pad SRAM in
the architecture. Since theHistarray is relatively small, we can store it in the SRAM, so that
it does not conflict withBrightnessLevelin the data cache. This storage assignment improves
the performance of theHistogram Evaluation code significantly.
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
13/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 13
We present a strategy for partitioning scalar and array variables in an application code into
Scratch-Pad memory and off-chip DRAM accessed through data cache, to maximize the
performance by selectively mapping to the SRAM those variables that are estimated to cause
the maximum number of conflicts in the data cache.
> The Partitioning Strategy
The overall approach in partitioning program variables into Scratch-Pad memory and DRAM
is to minimize the cross-interference between different variables in the data cache. We first
outline the different features of the code
affecting the partitioning.
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
14/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 14
5.CACHE
Want inexpensive, fast memory
Main memory
Large, inexpensive, slow memory stores entire program and data
Cache
Small, expensive, fast memory stores copy of likely accessed parts of larger
memory
Can be multiple levels of cache
>Introduction to Memory Hierarchy
Usually designed with SRAM
faster but more expensive than DRAM
Usually on same chip as processor
space limited, so much smaller than off-chip main memory
faster access ( 1 cycle vs. several cycles for main memory)
Cache operation:
Request for main memory access (read or write)
First, check cache for copy
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
15/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 15
cache hit
copy is in cache, quick access
cache miss
copy not in cache, read address and possibly its neighbors into
cache Several cache design choices
cache mapping, replacement policies, and write techniques
>Different Mapping Techniques
Direct Mapping
Main memory address divided into 2 fields Index
cache address
number of bits determined by cache size
Tag
compared with tag stored in cache at address indicated by index
if tags match, check valid bit
Valid bit
indicates whether data in slot has been loaded from memory
Offset used to find particular word in cache line
Fully associative mapping
Complete main memory address stored in each cache address
All addresses stored in cache simultaneously compared with desired address
Valid bit and offset same as direct mapping
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
16/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 16
Set associative Mapping
Compromise between direct mapping and fully associative mapping
Index same as in direct mapping
But, each cache address contains content and tags of 2 or more memory address
locations
Tags of that set simultaneously compared as in fully associative mapping
Cache with set size N called N-way set-associative
2-way, 4-way, 8-way are common
Technique for choosing which block to replace
when fully associative cache is full
when set-associative caches line is full
Direct mapped cache has no choice
Random
replace block chosen at random
LRU: least-recently used
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
17/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 17
replace block not accessed for longest time
FIFO: first-in-first-out
push block onto queue when accessed
choose block to replace by popping queue
>Cache Replacement Policy
When written, data cache must update main memory
Write-through
write to main memory whenever cache is written to
easiest to implement
processor must wait for slower main memory write
potential for unnecessary writes
Write-back
main memory only written when dirty block replaced
extra dirty bit for each block set when cache block written to
reduces number of slow main memory writes
>Cache Impact on system Performance
Most important parameters in terms of performance:
Total size of cache
total number of data bytes cache can hold
tag, valid and other house keeping bits not included in total
Degree of associativity
Data block size
Larger caches achieve lower miss rates but higher access cost
e.g.,
2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20
cycles
avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7
cycles 4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not
change
avg. cost of memory access = (0.935 * 3) + (0.065 * 20) =
4.105 cycles (improvement)
8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will
not change
avg. cost of memory access = (0.94435 * 4) + (0.05565 * 20) =
4.8904 cycles (worse)
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
18/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 18
6. Advanced RAM DRAMs commonly used as main memory in processor based embedded systems
high capacity, low cost
Many variations of DRAMs proposed
need to keep pace with processor speeds
FPM DRAM: fast page mode DRAM
EDO DRAM: extended data out DRAM
SDRAM/ESDRAM: synchronous and enhanced synchronous DRAM
RDRAM: rambus DRAM
6.1 Basic DRAM
Address bus multiplexed between row and column components
Row and column addresses are latched in, sequentially, by strobing ras (row address strobe)and cas (column address strobe) signals, respectively
Refresh circuitry can be external or internal to DRAM device
strobes consecutive memory address periodically causing memory content to be
refreshed
Refresh circuitry disabled during read or write operation
Fast Page Mode DRAM (FPM DRAM)
Each row of memory bit array is viewed as a page
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
19/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 19
Page contains multiple words
Individual words addressed by column address
Timing diagram:
row (page) address sent
3 words read consecutively by sending column address for each
Extra cycle eliminated on each read/write of words from same
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
20/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 20
Extended data out DRAM (EDO DRAM)
Improvement of FPM DRAM
Extra latch before output buffer
allows strobing ofcasbefore data read operation completed Reduces read/write latency by additional cycle
(Synchronous and Enhanced Synchronous (ES) DRAM
SDRAM latches data on active edge of clock
Eliminates time to detect ras/cas and rd/wrsignals
A counter is initialized to column address then incremented on active edge of clock to
access consecutive memory locations
ESDRAM improves SDRAM
added buffers enable overlapping of column addressing
faster clocking and lower read/write latency possible
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
21/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 21
Rambus DRAM (RDRAM)
More of a bus interface architecture than DRAM architecture
Data is latched on both rising and falling edge of clock
Broken into 4 banks each with own row decoder
can have 4 pages open at a time
Capable of very high throughput
6.2 DRAM Integration Problem SRAM easily integrated on same chip as processor
DRAM more difficult
Different chip making process between DRAM and conventional logic
Goal of conventional logic (IC) designers:
- minimize parasitic capacitance to reduce signal propagation delays and power
consumption
Goal of DRAM designers:
- create capacitor cells to retain stored information
Integration processes beginning to appear
6.3 Memory Management Unit (MMU)
Duties of MMU
Handles DRAM refresh, bus interface and arbitration
Takes care of memory sharing among multiple processors
Translates logic memory addresses from processor to physical memory addresses
of DRAM Modern CPUs often come with MMU built-in
Single-purpose processors can be used
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
22/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 22
7.Cache Coherence Protocols
The presence of caches in current-generation distributed shared-memory multiprocessors
improves performance by reducing the processors memory access time and by
decreasing the bandwidth requirements of both the local memory module and the global
interconnect. Unfortunately, the local caching of data introduces the cache coherence
problem. Early distributed shared-memory machines left it to the programmer to deal with
the cache coherence problem, and consequently these machines were considered difficult
to program [5][38][54]. Todays multiprocessors solve the cache coherence problem in
hardware by implementing a cache coherence protocol. This chapter outlines the cache
coherence problem and describes how cache coherence protocols solve it.In addition, this chapter discusses several different varieties of cache coherence protocols
including their advantages and disadvantages, their organization, their common protocol
transitions, and some examples of machines that implement each protocol. Ultimately
a designer has to choose a protocol to implement, and this should be done carefully. Protocol
choice can lead to differences in cache miss latencies and differences in the number of
messages sent through the interconnection network, both of which can lead to differences
in overall application performance. Moreover, some protocols have high-level properties
like automatic data distribution or distributed queueing that can help application performance.
Before discussing specific protocols, however, let us examine the cache coherence
problem in distributed shared-memory machines in detail.
7.1 The Cache Coherence Problem
Figure 2.1 depicts an example of the cache coherence problem. Memory initially contains
the value 0 for location x, and processors 0 and 1 both read location x into theircaches. If
processor 0 writes location x in its cache with the value 1, then processor 1scache nowcontains the stale value 0 for location x. Subsequent reads of location x by processor1
willcontinue to return the stale, cached value of 0. This is likely not what the programmer
expected when she wrote the program. The expected behavior is for a read byany processor to
return the most up-to-date copy of the datum. This is exactly what acache coherence protocol
does: it ensures that requests for a certain datum always returnthe most recent value.
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
23/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 23
The coherence protocol achieves this goal by taking action whenever a location is written.
More precisely, since the granularity of a cache coherence protocol is a cache line, the
protocol takes action whenever any cache line is written. Protocols can take two kinds ofactions when a cache lineL is writtenthey may eitherinvalidate all copies ofL from the
other caches in the machine, or they may update those lines with the new value being written.
Continuing the earlier example, in an invalidation-based protocol when processor 0
writesx = 1, the line containingx is invalidated from processor 1s cache. The next time
processor 1 reads locationx it suffers a cache miss, and goes to memory to retrieve the latest
copy of the cache line. In systems with write-through caches, memory can supply the
data because it was updated when processor 0 wrotex. In the more common case of systems
with writeback caches, the cache coherence protocol has to ensure that processor 1
asks processor 0 for the latest copy of the cache line. Processor 0 then supplies the line
from its cache and processor 1 places that line into its cache, completing its cache miss. In
update-based protocols when processor 0 writesx = 1, it sends the new copy of the datum
directly to processor 1 and updates the line in processor 1s cache with the new value. In
either case, subsequent reads by processor 1 now see the correct value of 1 for location
x, and the system is said to be cache coherent.
Most modern cache-coherent multiprocessors use the invalidation technique rather than
the update technique since it is easier to implement in hardware. As cache line sizes continue
to increase, the invalidation-based protocols remain popular because of the increased
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
24/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 24
number of updates required when writing a cache line sequentially with an update-based
coherence protocol. There are times however, when using an update-based protocol is
superior. These include accessing heavily contended lines and some types of synchronization
variables. Typically designers choose an invalidation-based protocol and add some
special features to handle heavily contended synchronization variables. All the protocolspresented in this paper are invalidation-based cache coherence protocols, and a later section
is devoted to the discussion of synchronization primitives.
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
25/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 25
8. Directory-Based Coherence
The previous section describes the cache
coherence problem and introduces the cachecoherence protocol as the agent that solves the coherence problem. But the question
remains, how do cache coherence protocols work?
There are two main classes of cache coherence protocols,snoopy protocols and directory-
based protocols. Snoopy protocols require the use of a broadcast medium in the
machine and hence apply only to small-scale bus-based multiprocessors. In these broadcast
systems each cache snoops on the bus and watches for transactions which affect it.
Any time a cache sees a write on the bus it invalidates that line out of its cache if it is
present. Any time a cache sees a read request on the bus it checks to see if it has the mostrecent copy of the data, and if so, responds to the bus request. These snoopy bus-based
systems are easy to build, but unfortunately as the number of processors on the bus
increases, the single shared bus becomes a bandwidth bottleneck and the snoopy protocols
reliance on a broadcast mechanism becomes a severe scalability limitation.
To address these problems, architects have adopted the distributed shared memory
(DSM) architecture. In a DSM multiprocessor each node contains the processor and its
caches, a portion of the machines physically distributed main memory, and a node controller
which manages communication within and between nodes (see Figure 2.2). Rather
than being connected by a single shared bus, the nodes are connected by a scalable
interconnection
network. The DSM architecture allows multiprocessors to scale to thousands
Chapter 2: Cache Coherence Protocols 13
of nodes, but the lack of a broadcast medium creates a problem for the cache coherence
protocol. Snoopy protocols are no longer appropriate, so instead designers must use a
directory-based cache coherence protocol.
The first description of directory-based protocols appears in Censier and Feautriers
1978 paper [9]. The directory is simply an auxiliary data structure that tracks the caching
state of each cache line in the system. For each cache line in the system, the directoryneeds to track which caches, if any, have read-only copies of the line, or which cache has
the latest copy of the line if the line is held exclusively. A directory-based cache-coherent
machine works by consulting the directory on each cache miss and taking the appropriate
action based on the type of request and the current state of the directory.
Figure 2.3 shows a directory-based DSM machine. Just as main memory is physically
distributed throughout the machine to improve aggregate memory bandwidth, so the directory
is distributed to eliminate the bottleneck that would be caused by a single monolithic
directory. If each nodes main memory is divided into cache-line-sized blocks, then the
directory can be thought of as extra bits of state for each block of main memory. Any time
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
26/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 26
a processor wants to read cache line L, it must send a request to the node that has the
directory
for lineL. This node is called the home node forL. The home node receives the
request, consults the directory, and takes the appropriate action. On a cache read miss, for
example, if the directory shows that the line is currently uncached or is cached read-only
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
27/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 27
9. MESI Cache Coherence .
Abstract
Nowadays, the computational systems (multi and uniprocessors) need to avoid the cache
coherence problem. There are some techniques to solve this problem. The MESI cache
coherence protocol is one of them. This paper presents a simulator of the MESI protocol
which is used for teaching the cache memory coherence on the computer systems with
hierarchical memory system and for explaining the process of the cache memory location in
multilevel cache memory systems. The paper shows a description of the course in which the
simulator is used, a short explanation about the MESI protocol and how the simulatorworks. Then, some experimental results in a real teaching environment are described.
Keywords: Cache memory, Coherence protocol, MESI, Simulator, Teaching tool.
9.1 Introduction
In multiprocessor systems, the memory should provide a set of
locations that hold values, and when a location is read itshould return the latest written value
to that location. This property must be established to communicate data between
threads or processes running on one processor. One reading returns the latest written value to
the location regardless ofwhich process wrote it. This question is known as the cache
coherence problem. This kind of problems arises even in uniprocessors when I/O operations
occur. Most I/O transfers are performed by direct memory access (DMA) devices
that move data between the memory and the peripheral component without involving the
processor [5]. When the DMAdevice writes to a location in main memory, unless special
action is taken, the processor may continue to see the old
value if that location was previously present in its cache [1]. The techniques and support
which are used to solve the multiprocessor cache coherence problem also solve the I/Ocoherence problem. Essentially all microprocessors today provide support for multiprocessor
cache coherence. The MESI cache coherence protocol is a technique to maintain the
coherence of the cache memory content in hierarchical memory systems [2], [7]. It is based
on four possible states of the
cache blocks: Modified, Exclusive, Shared and Invalid. Each accessed block lies in one of
these stages and the transitions among them define the MESI protocol. Nowadays, most
processors (Intel, AMD) use this protocol or its versions. Knowing how these processors
maintain the cache coherence is very important for the students. This paper
presents a simulator of the MESI cache coherence protocol [1], [6]. The MESI simulator is a
software tool which has been implemented in the JAVA language. It has been developed
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
28/31
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
29/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 29
10.MESI Protocol
The MESI protocol makes it possible to maintain the coherence in
cached systems. It is based on the four states that ablock in the cache memory can have. These four states are the abbreviations for MESI:
modified, exclusive, shared and
invalid. States are explained below:
Invalid: It is a non-valid state. The data you are looking for are not in the cache, or the local
copy of these data is not
correct because another processor has updated the corresponding memory position.
Shared: Shared without having been modified. Another processor can have the data into the
cache memory and bothcopies are in their current version.
Exclusive: Exclusive without having been modified. That is, this cache is the only one that
has the correct value of
the block. Data blocks are according to the existing ones in the main memory.
Modified: Actually, it is an exclusive-modified state. It means that the cache has the only
copy that is correct in the
whole system. The data which are in the main memory are wrong.
The state of each cache memory block can change depending on the actions taken by the
CPU [3]. Figure 1 presents
these transitions clearly.
Although the Figure 1 is very clear, here is a brief explanation: at the beginning, when the
cache is empty and a blockof memory is written into the cache by the processor, this block
has the exclusive state because there are no copies ofthat block in the cache. Then, if this
block is written, it changes to a modified state, because the block is only in one
cache but it has been modified and the block that is in the main memory is different to it.
On the other hand, if a block is in the exclusive state, when the CPU tries to read it and it
does not find the block, ithas to find it in the main memory and loads it into its cachememory. Then, that block is in two different caches so itsstate is shared. Then, if a CPU
wants to write into a block that is in the modified state and it is not in its cache, this block
has to be cleared from the cache where it was and it has to be loaded into the main memory
because it was the mostcurrent copy of that block in the system. In that case, the CPU writes
the block and it is loaded in its cache memorywith the exclusive state, because it is the most
current version now. If the CPU wants to read a block and it does not find
the block in its cache, this is because there is a more recent copy, so the system has to clear
the block from the cachewhere it was and to load it in the main memory. From there, the
block is read and the new state is shared because there
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
30/31
Embedded Memories
Dept of ECE,RVCE Bangalore. Page 30
are two current copies in the system. Another option is that a CPU writes into a shared block,
in this case the block
changes its state into exclusive.
Figure 1: Transitions from CPU bus
It should be taken into account that the state of a cache memory block can change because of
the actions of anotherCPU, an Input/Output interruption or a DMA. These transitions areshown in Figure 2. Hence, the processor is going touse the valid data in its operations. We do
not have to worry if a processor has changed data from the main memory andhas the most
current value of these data in its cache. With the MESI protocol, the processor obtains the
most currentvalue every time it is required.
7/28/2019 Embedded Memories based on SOC (VLSI Seminar)
31/31
Embedded Memories
11.References
[1] Culler, D.E., Singh, J.P., and Gupta, A. Parallel Computer Architecture. A hardware/software approach.
Morgan
Kaufmann Publishers, Inc., 1999.
[2] Hamacher, C., Vranesic, Z., and Zaky, S. Computer Organization. McGraw-Hill, 2003.[3] Handy, J. The Cache Memory Book. Academic Press, 1998.
[4] McGettrick, A., Thies, M.D., Soldan, D.L., and Srimani, P.K., Computer Engineering Curriculum in the
New
Millennium. IEEE Transactions on Education, vol. 46, no. 4, November 2003.
[5] Patterson, D.A., and Hennessy, J.L. Computer Organization and Design: The Hardware/Software Interface .
Morgan
Kaufman Publishers, Inc., 2004.
[6] Stalling, W. Computer Organization and Architecture. Prentice-Hall, 2006.
[7] Tanembaum, A.S. Structured Computer Organization. Prentice-Hall, 2006.CLEI ELECTRONIC JOURNAL, VOLUME 12, NUMBER 1, PAPER 5, APRIL 2009
Recommended