Embedded Memories based on SOC (VLSI Seminar)

7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

1/31

Embedded Memories

Dept of ECE,RVCE Bangalore. Page 1

Introduction

Embedded systems functionality aspects

Processing

processors

transformation of data

Storage

memory

retention of data

Communication

buses

transfer of data

Memory :Basic concept

Stores large number of bits

m x n: m words ofn bits each

k = Log2(m) address input signals

orm = 2^k words

e.g., 4,096 x 8 memory:

32,768 bits

12 address input signals

8 input/output data signals

Memory access

r/w: selects read or write

enable: read or write only when asserted

multiport: multiple accesses to different locations simultaneously


2/31

Embedded Memories


1.Write Ability and Storage performance

Traditional ROM/RAM distinctions ROM

read only, bits stored without power

RAM

read and write, lose stored bits without power

Traditional distinctions blurred

Advanced ROMs can be written to

e.g., EEPROM

Advanced RAMs can hold bits without power

e.g., NVRAM

Write ability

Manner and speed a memory can be written

Storage permanence

ability of memory to hold stored bits after they are written


3/31

Embedded Memories


write ability

Ranges of write ability

High end

processor writes to memory simply and quickly

e.g., RAM

Middle range

processor writes to memory, but slower

e.g., FLASH, EEPROM

Lower range

special equipment, programmer, must be used to write to memory

e.g., EPROM, OTP ROM

Low end bits stored only during fabrication

e.g., Mask-programmed ROM

In-system programmable memory

Can be written to by a processor in the embedded system using the memory

Memories in high end and middle range of write ability

Storage performance

Range of storage permanence

High end

essentially never loses bits

e.g., mask-programmed ROM

Middle range

holds bits days, months, or years after memorys power source turned

off

e.g., NVRAM

Lower range holds bits as long as power supplied to memory

e.g., SRAM

Low end

begins to lose bits almost immediately after written

e.g., DRAM


4/31

Embedded Memories


2.ROM:Read only Memory

Nonvolatile memory

Holds bits after power is no longer supplied

High end and middle range of storage permanence

Nonvolatile memory

Can be read from but not written to, by a processor in an embedded system

Traditionally written to, programmed, before inserting to embedded system

Uses

Store software program for general-purpose processor

program instructions can be one or more ROM words Store constant data needed by system

Implement combinational circuit

Example :8*4 ROM

Horizontal lines = words

Vertical lines = data

Lines connected only at circles

Decoder sets word 2s line to 1 if address input is 010

Data lines Q3 and Q1 are set to 1 because there is a programmed connection with

word 2s line

Word 2 is not connected with data lines Q2 and Q0

Output is 1010


5/31

Embedded Memories


Mask-programmed ROM

Connections programmed at fabrication

set of masks

Lowest write ability

only once

Highest storage permanence

bits never change unless damaged

Typically used for final design of high-volume systems

spread out NRE cost for a low unit cost

OTP ROM:One-time programmable ROM

Connections programmed after manufacture by user

user provides file of desired contents of ROM

file input to machine called ROM programmer

each programmable connection is a fuse

ROM programmer blows fuses where connections should not exist

Very low write ability

typically written only once and requires ROM programmer device

Very high storage permanence

bits dont change unless reconnected to programmer and more fuses blown

Commonly used in final products

cheaper, harder to inadvertently modify


6/31

Embedded Memories


EPROM: Erasable programmable ROM

Programmable component is a MOS transistor

Transistor has floating gate surrounded by an insulator (a) Negative charges form a channel between source and drain storing a logic

1

(b) Large positive voltage at gate causes negative charges to move out of

channel and get trapped in floating gate storing a logic 0

(c) (Erase) Shining UV rays on surface of floating-gate causes negative

charges to return to channel from floating gate restoring the logic 1

(d) An EPROM package showing quartz window through which UV light can

pass

Better write ability

can be erased and reprogrammed thousands of times

Reduced storage permanence

program lasts about 10 years but is susceptible to radiation and electric noise

Typically used during design development

Connections programmed after manufacture by user

user provides file of desired contents of ROM

file input to machine called ROM programmer

each programmable connection is a fuse

ROM programmer blows fuses where connections should not exist

Very low write ability

typically written only once and requires ROM programmer device

Very high storage permanence

bits dont change unless reconnected to programmer and more fuses blown

Commonly used in final products

cheaper, harder to inadvertently modify


7/31

Embedded Memories


EEPROM: Electrically Erasable Programmable ROM

Programmed and erased electronically

typically by using higher than normal voltage

can program and erase individual words

Better write ability

can be in-system programmable with built-in circuit to provide higher than

normal voltage

built-in memory controller commonly used to hide details from

memory user

writes very slow due to erasing and programming

busy pin indicates to processor EEPROM still writing

can be erased and programmed tens of thousands of times

Similar storage permanence to EPROM (about 10 years)

Far more convenient than EPROMs, but more expensive

Flash Memory

Extension of EEPROM

Same floating gate principle

Same write ability and storage permanence

Fast erase Large blocks of memory erased at once, rather than one word at a time

Blocks typically several thousand bytes large

Writes to single words may be slower

Entire block must be read, word updated, then entire block written back

Used with embedded systems storing large data items in nonvolatile memory

e.g., digital cameras, TV set-top boxes, cell phones


8/31

Embedded Memories


3.RAM

Typically volatile memory

bits are not held without power supply

Read and written to easily by embedded system during execution

Internal structure more complex than ROM

a word consists of several memory cells, each storing 1 bit

each input and output data line connects to each cell in its column

rd/wr connected to every cell

when row is enabled by decoder, each cell has logic that stores input data bit

when rd/wr indicates write or outputs stored bit when rd/wr indicates read

Basic Types of RAM

SRAM: Static RAM

Memory cell uses flip-flop to store bit

Requires 6 transistors

Holds data as long as power supplied

DRAM: Dynamic RAM

Memory cell uses MOS transistor and capacitor to store bit

More compact than SRAM

Refresh required due to capacitor leak words cells refreshed when read

Typical refresh rate 15.625 microsec.

Slower to access than SRAM

RAM VARIATION

PSRAM: Pseudo-static RAM

DRAM with built-in memory refresh controller Popular low-cost high-density alternative to SRAM

NVRAM: Nonvolatile RAM

Holds data after external power removed

Battery-backed RAM

SRAM with own permanently connected battery

writes as fast as reads

no limit on number of writes unlike nonvolatile ROM-based memory

SRAM with EEPROM or flash

stores complete RAM contents on EEPROM or flash before power

turned off


9/31

Embedded Memories


4.Scratchpad Memory

Embedded processor-based system

>Processor core

>Embedded memory

> Instruction and Data Cache

>Embedded SRAM

>Embedded DRAM

Scratch Pad Memory

>Design problems

1. How much on-chip memory?

2. Partitioning of on-chip memory in cache and scratchpad?

3. Which variables/arrays in the scratchpad?

Goals

> Improve performance

> Save power

Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications


10/31

Embedded Memories


Abstract

Efficient utilizationof on-chip memory space is extremely important in modern embedded

system applications basedon microprocessor cores. In additionto a data cache that

interfaceswith slower off-chip memory, a fast on-chip SRAM,called Scratch-Pad memory, is

often used in several applications. this present a technique for efficiently exploiting onchip

Scratch-Pad memory by partitioning the applications scalar and array variables into off-

chipDRAM and on-chip Scratch-Pad SRAM, with the goal of minimizing the total execution

time of embedded applications.

> Introduction

Complex embedded system applications typically use heterogeneous chips consisting of

microprocessor cores, along with on-chip memory and co-processors. Flexibility and short

design time considerations drive the use of CPU cores as instantiable modules in system

designs [5]. The integration of processor cores and memory in the same chip effects a

reduction in the chip count, leading to costeffective solutions. Examples of commercial

microprocessor cores commonly used in system design are LSI Logics CW33000 series

[3]and the ARM series from AdvancedRISC Machines [10].


11/31

Embedded Memories


Typical examples of optional modules integrated with the processor on the same chip are:

Instruction Cache, Data Cache, and on-chip SRAM. The instruction and data caches are

fastlocal memory serving as an interface between the processor and the off-chip memory.

The on-chip SRAM, termed Scratch-Pad memory, is a small, high-speed data memory that is

mapped into an address space disjoint from the off-chip memory, but connected to the same

address and data buses.

Both the cache and Scratch-Pad SRAM have a single processor cycle ac-cess latency,

whereas an access to the off-chip memory (usually DRAM) takes several (typically 1020)

processor cycles.

The main difference between the Scratch-Pad SRAM and data cache is that the SRAM

guarantees a single-cycle access time, whereas an access to cache is subject to compulsory,

capacity, and conflict misses.

When an embedded application is compiled, the accessed data can now be stored either in the

Scratch-Pad memory or in off-chip memory. In the second case, it is accessed by the

processor through the data cache. We present a technique for minimizing the total execution

time of an embedded application by a careful partitioning of scalar and array variables used

in the application into off-chip DRAM (accessed through data cache) and Scratch-Pad

SRAM.

Optimization techniques for improving the data cache performance of programs have been

reported [4, 7, 9]. The analysis in [9] is limited to scalars, and hence, not generally applicable.Iteration space blockingfor improving data locality is studied in [4]. This technique is also


12/31

Embedded Memories


limited to the type of code that yields naturally to blocking. In [7],a data layout strategy for

avoiding conflict misses is presented. However, array access patterns in some applications

are too complex to be statically analyzable using this method. The availability of an on-chip

SRAM with guaranteed fast access time creates an opportunity for overcomingsome of the

cache conflict problems (Section 2). Theproblem of partitioning data into SRAM and cachewith the objective of maximizing performance, which we address inthis paper, has, to our

knowledge, not been attempted before.

> Problem Description

Figure 1(a) shows the architectural block diagram of an application employing a typical

embedded core processor (e.g., the LSI Logic CW33000 RISC Microprocessor core[3]),

where the parts enclosed in the dotted rectangle are implemented in one chip, and whichinterfaces with an off-chip memory, usually realized with DRAM. The address and data

buses from the CPU core connect to theDataCache, Scratch-Pad memory, and theExternal

Memory Interface (EMI) blocks. On a memory access request from the CPU, the data cache

indicates a cache hit to the EMI block through the C HIT signal. Similarly, if the SRAM

interface circuitry in the Scratch-Pad memory determines that the referenced memory address

maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to

the EMI through signal S HIT. If both the cache and SRAM report misses, the EMI transfers

a block of data of the appropriate size (equal to the cache line size) between

the cache and the DRAM.

The data address space mapping is shown in Figure 1(b),for a memory of size data words.

Memory addresses0 . . .1 map into the Scratch-Pad memory, and have a single processor

cycle access time. Thus, in Figure 1(a), S HIT would be asserted whenever the processor

attemptsto access any address in the range 0 . ..1. Memory addresses . . . 1 map into the off-

chip DRAM, and are accessed by the CPU through the data cache. A cache hit for an address

in the range. . . 1 results in a single-cycledelay, whereas a cache miss, which leads to a

block transfer between off-chip and cache memory, results in a delay of10-20 processor

cycles.

Suppose the above code is executed on a processor configured with a data cache of size 1

KByte. The performance is degraded by the conflict misses in the cache between elements of

the two arraysHistandBrightnessLevel. Data layout techniques, such as [7] are not effective

in eliminating the above type of conflicts, because the accesses to Histare data-dependent.

Note that this problem occurs in both direct-mapped as well as set-associative caches.

However, the conflict problem can be solved elegantly if we include a Scratch-Pad SRAM in

the architecture. Since theHistarray is relatively small, we can store it in the SRAM, so that

it does not conflict withBrightnessLevelin the data cache. This storage assignment improves

the performance of theHistogram Evaluation code significantly.


13/31

Embedded Memories


We present a strategy for partitioning scalar and array variables in an application code into

Scratch-Pad memory and off-chip DRAM accessed through data cache, to maximize the

performance by selectively mapping to the SRAM those variables that are estimated to cause

the maximum number of conflicts in the data cache.

> The Partitioning Strategy

The overall approach in partitioning program variables into Scratch-Pad memory and DRAM

is to minimize the cross-interference between different variables in the data cache. We first

outline the different features of the code

affecting the partitioning.


14/31

Embedded Memories


5.CACHE

Want inexpensive, fast memory

Main memory

Large, inexpensive, slow memory stores entire program and data

Cache

Small, expensive, fast memory stores copy of likely accessed parts of larger

memory

Can be multiple levels of cache

>Introduction to Memory Hierarchy

Usually designed with SRAM

faster but more expensive than DRAM

Usually on same chip as processor

space limited, so much smaller than off-chip main memory

faster access ( 1 cycle vs. several cycles for main memory)

Cache operation:

Request for main memory access (read or write)

First, check cache for copy


15/31

Embedded Memories


cache hit

copy is in cache, quick access

cache miss

copy not in cache, read address and possibly its neighbors into

cache Several cache design choices

cache mapping, replacement policies, and write techniques

>Different Mapping Techniques

Direct Mapping

Main memory address divided into 2 fields Index

cache address

number of bits determined by cache size

Tag

compared with tag stored in cache at address indicated by index

if tags match, check valid bit

Valid bit

indicates whether data in slot has been loaded from memory

Offset used to find particular word in cache line

Fully associative mapping

Complete main memory address stored in each cache address

All addresses stored in cache simultaneously compared with desired address

Valid bit and offset same as direct mapping


16/31

Embedded Memories


Set associative Mapping

Compromise between direct mapping and fully associative mapping

Index same as in direct mapping

But, each cache address contains content and tags of 2 or more memory address

locations

Tags of that set simultaneously compared as in fully associative mapping

Cache with set size N called N-way set-associative

2-way, 4-way, 8-way are common

Technique for choosing which block to replace

when fully associative cache is full

when set-associative caches line is full

Direct mapped cache has no choice

Random

replace block chosen at random

LRU: least-recently used


17/31

Embedded Memories


replace block not accessed for longest time

FIFO: first-in-first-out

push block onto queue when accessed

choose block to replace by popping queue

>Cache Replacement Policy

When written, data cache must update main memory

Write-through

write to main memory whenever cache is written to

easiest to implement

processor must wait for slower main memory write

potential for unnecessary writes

Write-back

main memory only written when dirty block replaced

extra dirty bit for each block set when cache block written to

reduces number of slow main memory writes

>Cache Impact on system Performance

Most important parameters in terms of performance:

Total size of cache

total number of data bytes cache can hold

tag, valid and other house keeping bits not included in total

Degree of associativity

Data block size

Larger caches achieve lower miss rates but higher access cost

e.g.,

2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20

cycles

avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7

cycles 4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not

change

avg. cost of memory access = (0.935 * 3) + (0.065 * 20) =

4.105 cycles (improvement)

8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will

not change

avg. cost of memory access = (0.94435 * 4) + (0.05565 * 20) =

4.8904 cycles (worse)


18/31

Embedded Memories


6. Advanced RAM DRAMs commonly used as main memory in processor based embedded systems

high capacity, low cost

Many variations of DRAMs proposed

need to keep pace with processor speeds

FPM DRAM: fast page mode DRAM

EDO DRAM: extended data out DRAM

SDRAM/ESDRAM: synchronous and enhanced synchronous DRAM

RDRAM: rambus DRAM

6.1 Basic DRAM

Address bus multiplexed between row and column components

Row and column addresses are latched in, sequentially, by strobing ras (row address strobe)and cas (column address strobe) signals, respectively

Refresh circuitry can be external or internal to DRAM device

strobes consecutive memory address periodically causing memory content to be

refreshed

Refresh circuitry disabled during read or write operation

Fast Page Mode DRAM (FPM DRAM)

Each row of memory bit array is viewed as a page


19/31

Embedded Memories


Page contains multiple words

Individual words addressed by column address

Timing diagram:

row (page) address sent

3 words read consecutively by sending column address for each

Extra cycle eliminated on each read/write of words from same


20/31

Embedded Memories


Extended data out DRAM (EDO DRAM)

Improvement of FPM DRAM

Extra latch before output buffer

allows strobing ofcasbefore data read operation completed Reduces read/write latency by additional cycle

(Synchronous and Enhanced Synchronous (ES) DRAM

SDRAM latches data on active edge of clock

Eliminates time to detect ras/cas and rd/wrsignals

A counter is initialized to column address then incremented on active edge of clock to

access consecutive memory locations

ESDRAM improves SDRAM

added buffers enable overlapping of column addressing

faster clocking and lower read/write latency possible


21/31

Embedded Memories


Rambus DRAM (RDRAM)

More of a bus interface architecture than DRAM architecture

Data is latched on both rising and falling edge of clock

Broken into 4 banks each with own row decoder

can have 4 pages open at a time

Capable of very high throughput

6.2 DRAM Integration Problem SRAM easily integrated on same chip as processor

DRAM more difficult

Different chip making process between DRAM and conventional logic

Goal of conventional logic (IC) designers:

- minimize parasitic capacitance to reduce signal propagation delays and power

consumption

Goal of DRAM designers:

- create capacitor cells to retain stored information

Integration processes beginning to appear

6.3 Memory Management Unit (MMU)

Duties of MMU

Handles DRAM refresh, bus interface and arbitration

Takes care of memory sharing among multiple processors

Translates logic memory addresses from processor to physical memory addresses

of DRAM Modern CPUs often come with MMU built-in

Single-purpose processors can be used


22/31

Embedded Memories


7.Cache Coherence Protocols

The presence of caches in current-generation distributed shared-memory multiprocessors

improves performance by reducing the processors memory access time and by

decreasing the bandwidth requirements of both the local memory module and the global

interconnect. Unfortunately, the local caching of data introduces the cache coherence

problem. Early distributed shared-memory machines left it to the programmer to deal with

the cache coherence problem, and consequently these machines were considered difficult

to program [5][38][54]. Todays multiprocessors solve the cache coherence problem in

hardware by implementing a cache coherence protocol. This chapter outlines the cache

coherence problem and describes how cache coherence protocols solve it.In addition, this chapter discusses several different varieties of cache coherence protocols

including their advantages and disadvantages, their organization, their common protocol

transitions, and some examples of machines that implement each protocol. Ultimately

a designer has to choose a protocol to implement, and this should be done carefully. Protocol

choice can lead to differences in cache miss latencies and differences in the number of

messages sent through the interconnection network, both of which can lead to differences

in overall application performance. Moreover, some protocols have high-level properties

like automatic data distribution or distributed queueing that can help application performance.

Before discussing specific protocols, however, let us examine the cache coherence

problem in distributed shared-memory machines in detail.

7.1 The Cache Coherence Problem

Figure 2.1 depicts an example of the cache coherence problem. Memory initially contains

the value 0 for location x, and processors 0 and 1 both read location x into theircaches. If

processor 0 writes location x in its cache with the value 1, then processor 1scache nowcontains the stale value 0 for location x. Subsequent reads of location x by processor1

willcontinue to return the stale, cached value of 0. This is likely not what the programmer

expected when she wrote the program. The expected behavior is for a read byany processor to

return the most up-to-date copy of the datum. This is exactly what acache coherence protocol

does: it ensures that requests for a certain datum always returnthe most recent value.


23/31

Embedded Memories


The coherence protocol achieves this goal by taking action whenever a location is written.

More precisely, since the granularity of a cache coherence protocol is a cache line, the

protocol takes action whenever any cache line is written. Protocols can take two kinds ofactions when a cache lineL is writtenthey may eitherinvalidate all copies ofL from the

other caches in the machine, or they may update those lines with the new value being written.

Continuing the earlier example, in an invalidation-based protocol when processor 0

writesx = 1, the line containingx is invalidated from processor 1s cache. The next time

processor 1 reads locationx it suffers a cache miss, and goes to memory to retrieve the latest

copy of the cache line. In systems with write-through caches, memory can supply the

data because it was updated when processor 0 wrotex. In the more common case of systems

with writeback caches, the cache coherence protocol has to ensure that processor 1

asks processor 0 for the latest copy of the cache line. Processor 0 then supplies the line

from its cache and processor 1 places that line into its cache, completing its cache miss. In

update-based protocols when processor 0 writesx = 1, it sends the new copy of the datum

directly to processor 1 and updates the line in processor 1s cache with the new value. In

either case, subsequent reads by processor 1 now see the correct value of 1 for location

x, and the system is said to be cache coherent.

Most modern cache-coherent multiprocessors use the invalidation technique rather than

the update technique since it is easier to implement in hardware. As cache line sizes continue

to increase, the invalidation-based protocols remain popular because of the increased


24/31

Embedded Memories


number of updates required when writing a cache line sequentially with an update-based

coherence protocol. There are times however, when using an update-based protocol is

superior. These include accessing heavily contended lines and some types of synchronization

variables. Typically designers choose an invalidation-based protocol and add some

special features to handle heavily contended synchronization variables. All the protocolspresented in this paper are invalidation-based cache coherence protocols, and a later section

is devoted to the discussion of synchronization primitives.


25/31

Embedded Memories


8. Directory-Based Coherence

The previous section describes the cache

coherence problem and introduces the cachecoherence protocol as the agent that solves the coherence problem. But the question

remains, how do cache coherence protocols work?

There are two main classes of cache coherence protocols,snoopy protocols and directory-

based protocols. Snoopy protocols require the use of a broadcast medium in the

machine and hence apply only to small-scale bus-based multiprocessors. In these broadcast

systems each cache snoops on the bus and watches for transactions which affect it.

Any time a cache sees a write on the bus it invalidates that line out of its cache if it is

present. Any time a cache sees a read request on the bus it checks to see if it has the mostrecent copy of the data, and if so, responds to the bus request. These snoopy bus-based

systems are easy to build, but unfortunately as the number of processors on the bus

increases, the single shared bus becomes a bandwidth bottleneck and the snoopy protocols

reliance on a broadcast mechanism becomes a severe scalability limitation.

To address these problems, architects have adopted the distributed shared memory

(DSM) architecture. In a DSM multiprocessor each node contains the processor and its

caches, a portion of the machines physically distributed main memory, and a node controller

which manages communication within and between nodes (see Figure 2.2). Rather

than being connected by a single shared bus, the nodes are connected by a scalable

interconnection

network. The DSM architecture allows multiprocessors to scale to thousands

Chapter 2: Cache Coherence Protocols 13

of nodes, but the lack of a broadcast medium creates a problem for the cache coherence

protocol. Snoopy protocols are no longer appropriate, so instead designers must use a

directory-based cache coherence protocol.

The first description of directory-based protocols appears in Censier and Feautriers

1978 paper [9]. The directory is simply an auxiliary data structure that tracks the caching

state of each cache line in the system. For each cache line in the system, the directoryneeds to track which caches, if any, have read-only copies of the line, or which cache has

the latest copy of the line if the line is held exclusively. A directory-based cache-coherent

machine works by consulting the directory on each cache miss and taking the appropriate

action based on the type of request and the current state of the directory.

Figure 2.3 shows a directory-based DSM machine. Just as main memory is physically

distributed throughout the machine to improve aggregate memory bandwidth, so the directory

is distributed to eliminate the bottleneck that would be caused by a single monolithic

directory. If each nodes main memory is divided into cache-line-sized blocks, then the

directory can be thought of as extra bits of state for each block of main memory. Any time


26/31

Embedded Memories


a processor wants to read cache line L, it must send a request to the node that has the

directory

for lineL. This node is called the home node forL. The home node receives the

request, consults the directory, and takes the appropriate action. On a cache read miss, for

example, if the directory shows that the line is currently uncached or is cached read-only


27/31

Embedded Memories


9. MESI Cache Coherence .

Abstract

Nowadays, the computational systems (multi and uniprocessors) need to avoid the cache

coherence problem. There are some techniques to solve this problem. The MESI cache

coherence protocol is one of them. This paper presents a simulator of the MESI protocol

which is used for teaching the cache memory coherence on the computer systems with

hierarchical memory system and for explaining the process of the cache memory location in

multilevel cache memory systems. The paper shows a description of the course in which the

simulator is used, a short explanation about the MESI protocol and how the simulatorworks. Then, some experimental results in a real teaching environment are described.

Keywords: Cache memory, Coherence protocol, MESI, Simulator, Teaching tool.

9.1 Introduction

In multiprocessor systems, the memory should provide a set of

locations that hold values, and when a location is read itshould return the latest written value

to that location. This property must be established to communicate data between

threads or processes running on one processor. One reading returns the latest written value to

the location regardless ofwhich process wrote it. This question is known as the cache

coherence problem. This kind of problems arises even in uniprocessors when I/O operations

occur. Most I/O transfers are performed by direct memory access (DMA) devices

that move data between the memory and the peripheral component without involving the

processor [5]. When the DMAdevice writes to a location in main memory, unless special

action is taken, the processor may continue to see the old

value if that location was previously present in its cache [1]. The techniques and support

which are used to solve the multiprocessor cache coherence problem also solve the I/Ocoherence problem. Essentially all microprocessors today provide support for multiprocessor

cache coherence. The MESI cache coherence protocol is a technique to maintain the

coherence of the cache memory content in hierarchical memory systems [2], [7]. It is based

on four possible states of the

cache blocks: Modified, Exclusive, Shared and Invalid. Each accessed block lies in one of

these stages and the transitions among them define the MESI protocol. Nowadays, most

processors (Intel, AMD) use this protocol or its versions. Knowing how these processors

maintain the cache coherence is very important for the students. This paper

presents a simulator of the MESI cache coherence protocol [1], [6]. The MESI simulator is a

software tool which has been implemented in the JAVA language. It has been developed


28/31


29/31

Embedded Memories


10.MESI Protocol

The MESI protocol makes it possible to maintain the coherence in

cached systems. It is based on the four states that ablock in the cache memory can have. These four states are the abbreviations for MESI:

modified, exclusive, shared and

invalid. States are explained below:

Invalid: It is a non-valid state. The data you are looking for are not in the cache, or the local

copy of these data is not

correct because another processor has updated the corresponding memory position.

Shared: Shared without having been modified. Another processor can have the data into the

cache memory and bothcopies are in their current version.

Exclusive: Exclusive without having been modified. That is, this cache is the only one that

has the correct value of

the block. Data blocks are according to the existing ones in the main memory.

Modified: Actually, it is an exclusive-modified state. It means that the cache has the only

copy that is correct in the

whole system. The data which are in the main memory are wrong.

The state of each cache memory block can change depending on the actions taken by the

CPU [3]. Figure 1 presents

these transitions clearly.

Although the Figure 1 is very clear, here is a brief explanation: at the beginning, when the

cache is empty and a blockof memory is written into the cache by the processor, this block

has the exclusive state because there are no copies ofthat block in the cache. Then, if this

block is written, it changes to a modified state, because the block is only in one

cache but it has been modified and the block that is in the main memory is different to it.

On the other hand, if a block is in the exclusive state, when the CPU tries to read it and it

does not find the block, ithas to find it in the main memory and loads it into its cachememory. Then, that block is in two different caches so itsstate is shared. Then, if a CPU

wants to write into a block that is in the modified state and it is not in its cache, this block

has to be cleared from the cache where it was and it has to be loaded into the main memory

because it was the mostcurrent copy of that block in the system. In that case, the CPU writes

the block and it is loaded in its cache memorywith the exclusive state, because it is the most

current version now. If the CPU wants to read a block and it does not find

the block in its cache, this is because there is a more recent copy, so the system has to clear

the block from the cachewhere it was and to load it in the main memory. From there, the

block is read and the new state is shared because there


30/31

Embedded Memories


are two current copies in the system. Another option is that a CPU writes into a shared block,

in this case the block

changes its state into exclusive.

Figure 1: Transitions from CPU bus

It should be taken into account that the state of a cache memory block can change because of

the actions of anotherCPU, an Input/Output interruption or a DMA. These transitions areshown in Figure 2. Hence, the processor is going touse the valid data in its operations. We do

not have to worry if a processor has changed data from the main memory andhas the most

current value of these data in its cache. With the MESI protocol, the processor obtains the

most currentvalue every time it is required.


31/31

Embedded Memories

11.References

[1] Culler, D.E., Singh, J.P., and Gupta, A. Parallel Computer Architecture. A hardware/software approach.

Morgan

Kaufmann Publishers, Inc., 1999.

[2] Hamacher, C., Vranesic, Z., and Zaky, S. Computer Organization. McGraw-Hill, 2003.[3] Handy, J. The Cache Memory Book. Academic Press, 1998.

[4] McGettrick, A., Thies, M.D., Soldan, D.L., and Srimani, P.K., Computer Engineering Curriculum in the

New

Millennium. IEEE Transactions on Education, vol. 46, no. 4, November 2003.

[5] Patterson, D.A., and Hennessy, J.L. Computer Organization and Design: The Hardware/Software Interface .

Morgan

Kaufman Publishers, Inc., 2004.

[6] Stalling, W. Computer Organization and Architecture. Prentice-Hall, 2006.

[7] Tanembaum, A.S. Structured Computer Organization. Prentice-Hall, 2006.CLEI ELECTRONIC JOURNAL, VOLUME 12, NUMBER 1, PAPER 5, APRIL 2009

Embedded Memories based on SOC (VLSI Seminar)

Documents

3. ASIC and SOC Design Methods: Structured VLSI Design Spring 2009 Rajesh K. Gupta

VLSI-SoC: From Algorithms to Circuits and System-on-Chip

VLSI-SOC 2012 Update vlsisoc2012.soe.ucsc

Technical Sem Fault Testing-Vlsi Soc

Introduction to VLSI 2020. 8. 17. · Introduction to VLSI ITI Ismailia Silicon technology roadmap low power SoC high performance MPU/SoC 2001 2004 2010 2001 2004 2010 gate length

Spring 07, Jan 30 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 SOC Test Scheduling Vishwani D. Agrawal James

Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University

Low Power CMOS VLSI Circuit Design by Kaushik Roy Memories

VLSI SOC Design ET4351

Lecture 13: SRAMcmosvlsi.com/lect13.pdf13: SRAM CMOS VLSI Design Slide 31 Serial Access Memories qSerial access memories do not use an address – Shift Registers – Tapped Delay

High Performance SoC Design Using Magnetic …dl.ifip.org/db/conf/vlsi/vlsisoc2011s/ZhaoTCBZGSLKERC11.pdfHigh Performance SoC Design using Magnetic Logic and Memory Weisheng Zhao1,

SoC Verification Methodology - VLSI Laboratory

Robust Low Power VLSI R obust L ow P ower VLSI Aatmesh Shrivastava Taniya Siddiqua Incorporating Reliability in SoC Flow

INTEGRATION, the VLSI journal - VLSI & SoC Design Labsoc.inha.ac.kr/images/VLSI_pub1709.pdf · 2017-06-27 · easily than the sum operation with respect to VLSI implementation, the

VLSI-SoC 2013 Program

A.K. Kambham (imec), VLSI-T 2012 Lecture 8ee290d/fa13/LectureNotes/...Performance Embedded DRAM SoC Technology Featuring Tri-Gate Transistors and MIMCAP COB,” Symposium on VLSI Technology

MIXED SIGNAL VLSI TECHNOLOGY BASED SoC DESIGN FOR TEMPERATURE COMPENSATED pH MEASUREMENT

26th VLSI-SoC 2018

2017 IFIP/IEEE International VLSI-SOC Conference Proposal Boston, Massachusetts USA Martin Margala

NETLIST PROCESSING FOR CUSTOM VLSI VIA … PROCESSING FOR CUSTOM VLSI VIA PATTERN MATCHING ... The expense and complexity of modern VLSI designs ... from high volume memories to small