CSC 370 (Blum)1 Cache Based in part on Chapter 9 in Computer Architecture (Nicholas Carter)

CSC 370 (Blum) 1

Cache

Based in part on Chapter 9 in Computer Architecture

(Nicholas Carter)

CSC 370 (Blum) 2

Pentium 4 Blurb (L1)

Some cache terms to understand:

• Data cache

• Execution Trace Cache

CSC 370 (Blum) 3

Pentium 4 Blurb (L2)

Some cache terms to understand:

• Non-Blocking

• 8-way set associativity

• on-die

CSC 370 (Blum) 4

Caching Analogy: Why grading late

homework is a pain • To grade a student’s homework problem, a

professor must 1. Solve the problem

2. Compare the answer with the student’s

• When grading the homework of a class of students’ homework, the professor can

1. Solve the problem

2. Compare the answer with Student 1’s answer

3. Compare the answer with Student 2’s answer

4. …

CSC 370 (Blum) 5

Caching Analogy (Cont.)

• In other words, the professor “caches” the solution so that all students after the first can be graded much more quickly than the first.

• Even if the professor “stores” the solution (that is, files it away), it is not handy when it comes time to grade the late student’s homework.

CSC 370 (Blum) 6


• You might think the benefits of caching are too contrived in the previous example since the professor instructed all of the students to solve the same problem and submit it at the same time.

• Suppose students (on their own volition) looked at the problems at the end of the chapter being discussed. – It’s hard to imagine, I know.

CSC 370 (Blum) 7


• Then a student might come to the professor’s office for help on a difficult problem.

• The professor should keep the solution handy because a problem that was difficult for one student is likely to be difficult for other students who are likely to turn up soon.

• This is the notion of “locality of reference” – What was needed/used recently is likely to be

needed/used again soon.

CSC 370 (Blum) 8

Locality Of Reference

• The memory assigned to an executing program will have both data and instructions. At a given time, the probability that the processor will need to access a given memory location is not equally distributed among all of the memory locations. – The program may be more likely to need a location that

it has accessed in the recent past – this is known as temporal locality.

– The program may be more likely to need a location that it is near the one just accessed – this is known as spatial locality.

CSC 370 (Blum) 9

Loops and Arrays

• Consider that the tasks best suited for automation (to be done by machine including a computer) are repetitive.

• Any program with loops and arrays is a good candidate to display locality of reference.

• Also waiting for some user event is also very repetitive. This repetition may be hidden from the programmer working with a high-level language.

CSC 370 (Blum) 10

Locality of reference

• Locality of reference is the principle behind caching.

• Locality of reference is what allows 256-512 KB of cache to stand in for 256-512 MB of memory (that’s a factor of 1000) ninety-some percent of the time.

CSC 370 (Blum) 11

Caching• The term cache can be used in different ways. • Sometimes “cache” is used to refer generally to

placing something where it can be retrieved quickly. In this sense of the term, there is an entire hierarchy of caching, SRAM is faster than DRAM is faster than the hard drive is faster than the Internet.

• Sometimes “cache” is used to refer specifically to top layer of layer of the above hierarchy (the SRAM). – For the rest of the presentation, we will be using the latter

meaning.

CSC 370 (Blum) 12

What are we caching?

• We have to look one level down in the memory/storage hierarchy to realize what it is we are caching.

• One level down is main memory. – Recall how one interacts with memory

(DRAM) – one supplies an address to obtain the value located at that address.

CSC 370 (Blum) 13

What are we caching? (Cont.)

• We must cache the address and the value. – Recall our analogy – if the professor writes down the

(final) answer but does not recall what problem it is the answer to, it is useless.

• Ultimately we want the value, but it is the (memory) address we will be given and that is what we will search for in our cache. – The student does not ask if 43 is the answer (the answer

to what?); the student asks what is the answer to problem 5-15.

CSC 370 (Blum) 14

Some terminology• Think of cache as parallel arrays (address and

values). • The array of addresses is called the tag array. • The array of values is called the data array.

– Don’t confuse the terms “data array” and “data cache.”• A memory address is supplied:

– If it is found in the tag array, one is said to have a cache hit and the value is sent out.

– If it is not found, one has a cache miss, and the processor must go to memory to obtain the desired value.

– The percentage of cache hits is known as the hit rate (usually looking for 90% or better).

CSC 370 (Blum) 15

Cache Controller• In addition to the tag and data arrays is the cache

controller which runs the show. – When L2 cache was separate from the processor, the

cache controller was part of the system chipset.

– When L2 cache moved onto the microprocessor so to did the controller.

– Now it is the L3 cache controller which is part of the system chipset.

• Now even L3 is moving onto the microprocessor.

CSC 370 (Blum) 16

One caches addresses (tags) and values

Cache Address

Memory Address (tag)

Memory Value

(data)

0000 FFA0 Some value

0001 FF18 Another

0002 FFB0 Yet Another

… … …

CSC 370 (Blum) 17

Data Array versus Data Cache• The term data array refers to the set of values that

are placed in cache.• The term data cache refers the caching of data as

opposed to the instruction cache where instructions are cached.

• In a modern adaptation of the Harvard architecture, called the Harvard cache, data and instructions are sent to separate caches. – Unlike data, an instruction is unlikely to be updated,

overwritten yes, updated no. Therefore data cache and instruction cache can have differ write policies.

CSC 370 (Blum) 18

Capacity

• The usual specification (spec) one is given for cache is called the capacity. – E.g. Norwood-core Pentium 4s have a 512 KB

L2 cache.

• The capacity refers only to the amount of information in the data array (values). – It does not include the tag array (addresses), the

dirty bits, and so on.

CSC 370 (Blum) 19

Lines and Line Lengths

• The basic unit of memory is a byte, the basic unit of cache is a line. – Be careful not to use the word “block” in place of

“line.” In cache, blocking means that upon a cache miss, one must write the new values to cache before proceeding.

• A line consists of many bytes (typically a power of 2, such as 32, 64 or 128). The number of bytes in a line is called the line length.

CSC 370 (Blum) 20

Memory CacheFFA0 26

FFA1 FD

FFA2 A7

FFA3 37

… …

Cache Address

Tag Line Value

0011 FFA 26FDA737…

0 1 2 3 ….

CSC 370 (Blum) 21

Example

• Assume a capacity of 512 KB.

• Don’t think of an array with 524,288 (512 K) elements with each element a byte long as you would if it were main memory.

• Instead think of an array with 16,384 (16 K) elements with each element 32 bytes long.

CSC 370 (Blum) 22

Line Length Benefits

• The concept of cache lines has a few benefits1. It directly builds in the notion of spatial locality –

cache is physically designed to hold the contents of several consecutive memory locations.

2. Eventually we must perform a search on the tags to see if the particular memory address has been cached. The line length shortens the tag, i.e. the item one must search. – In the example on the earlier slide one would search for

FFA instead of FFA3. That is the tag is four bits smaller than the address.

CSC 370 (Blum) 23

Line Length Benefits

3. The cached value must have been read from memory. Recall that one can significantly improve the efficiency of reading memory locations if they are consecutive locations (especially if they are all in the same row).

– So the paging/bursting improvements of reading memory are particularly important because of the way cache is structured.

CSC 370 (Blum) 24

Hardware Searching

• The cache is handed a memory address, it strips off the least significant bits to form the corresponding search tag, it then must search the tag array for that value. – The most efficient search algorithm you know

is useless at this level, we need to perform the search in a couple clock cycle. We need to search using hardware.

CSC 370 (Blum) 25

Variations

• The hardware search can be executed in a number of ways and this is where the terms direct-mapped, fully associative and set-associative come in. – The Pentium 4’s Advanced Transfer cache has 8-

way set associativity.

• The variations determine how many comparators (circuitry that determines whether we have a hit or miss) are necessary.

CSC 370 (Blum) 26

XNOR: Bit Equality Comparator

CSC 370 (Blum) 27

ANDed XNORs: Word Equality Comparator

CSC 370 (Blum) 28

Direct Mapping• Direct Mapping simplifies tag-array

searching (i.e. minimizes the number of comparators) by saying that a given memory location can be cached in one and only one line of cache. – The mapping is not one-to-one. Since memory

is about a thousand times bigger than cache, many memory locations share a cache line and only one section of memory can be in there at a time.

CSC 370 (Blum) 29

Memory

Direct Mapping Cache

A given memory location is mapped to one and only one line of cache. But each line of cache corresponds to several (sets of) memory locations. Only one of these can be cached at a given time.

CSC 370 (Blum) 30

A Direct Mapping ScenarioMemory Address

Determines position within the line of cache

Determines the cache address that will be used

The part of the address actually stored in the tag array

CSC 370 (Blum) 31

A Direct Mapping Scenario (Cont.)

• A memory address is handed to cache. • The middle portion is used to select the cache

address. • The tag stored at that cache address and the upper

portion of the original memory address are sent to a comparator. – Note there’s one comparator!

• If they are equal (a cache hit), then the lower portion of the original memory address is used to select out the byte from within the line.

CSC 370 (Blum) 32

A Potential Problem with Direct Mapping

• Recall that locality of reference (the notion behind caching) is particularly effective during repetitive tasks.

• Imagine that a loop involves two memory locations that share the same cache address (perhaps it processes a large array). Then each time the processors wanted one of the locations, the other would be in the cache. Thus, there would be two cache misses for each iteration of the loop. But loops are when caching is supposed to be at its most effective.

• TOO MANY CACHE MISSES!

CSC 370 (Blum) 33

Fully Associative Cache: The Other Extreme

• In Direct Mapping, a given memory location is mapped onto one and only one cache location.

• In Fully Associative Caches, a given memory location can be mapped to any cache location. – This will solve the previous problem. There’s no conflict –

one caches whatever is needed for the loop. – But with fully-associative cache searching becomes more

difficult, one has to examine the entire tag array whereas before with direct mapping there was only one place to look.

CSC 370 (Blum) 34

Associativity = Many Comparators

• Looping through the tag array would be prohibitively slow. We must compare the memory address (or the appropriate portion thereof) to all of the values in the tag array simultaneously.

CSC 370 (Blum) 35

Array of Comparators

Tags Comparators

Address

Hit?

Yes or no

Address of hit

For each element of the tag array, there is a comparator. Each comparator checks the tag element against the search tag.

CSC 370 (Blum) 36

Associative memory a.k.a. content addressable memory

CSC 370 (Blum) 37

Associative memory

• In regular memory, one provides an address, and then the value at that address is supplied.

• In associative memory (content addressable memory), one provides the value or some part thereof, and then the address and/or the remainder of the value is supplied.

CSC 370 (Blum) 38

The Problem with Fully Associative Cache

• All of those comparators are made of transistors. They take up room on the die. And any space lost to comparators has to be taken away from the data array. – After all we’re talking about thousands of

comparators.

• ASSOCIATIVITY LOWERS CAPACITY!

CSC 370 (Blum) 39

Set-Associative Caches: The Compromise

• For example, instead of having the 1000-to-1 mapping we had with direct mapping, we could elect to have an 8000-to-8 mapping.

• That is, a given memory location can be cached into any of 8 cache locations, but the set of memory locations sharing those cache locations has also gone up by a factor of 8.

• This would be called an 8-way set associative cache.

CSC 370 (Blum) 40

A Happy Medium

• 4- or 8-way set associative provides enough flexibility to allow one (under most circumstances) to cache the necessary memory locations to get the desired effects of caching for an iterative procedure.– I.e. it minimizes cache misses.

• But it only requires 4 or 8 comparators instead of the thousands required for fully associative caches.

CSC 370 (Blum) 41

Bad Direct Mapping Scenario Recalled

• With direct mapping cache, the loop involves memory locations that share the same cache address. With set associative cache, the loop involves memory locations that share the same set of cache addresses.

• It is thus possible with set associative cache that each of these memory locations is cache to a different member of the set. The iterations can proceed without repeated cache misses.

CSC 370 (Blum) 42

Set-Associative Cache

• Again the memory address is broken into three parts. – One part determines the position in the line. – One part determines this time a set of cache

addresses. – The last part is compared to what is stored in

the tags of the set of cache locations. – Etc.

CSC 370 (Blum) 43

PCGuide.com comparison table

To which we add that full associativity has an adverse effect on capacity.

CSC 370 (Blum) 44

Cache Misses

• When a cache miss occurs, several factors have to be considered. For example, – We want the new memory location written into

the cache, but where? – Can we continue attempting other cache

interactions or should we wait? – What if the cached data has been modified? – Should we do anything with the data we taking

out of the cache?

CSC 370 (Blum) 45

Replacement Policy

• Upon a cache miss, the memory that was not found in cache will be written to cache, but where? – In Direct Mapping, there is no choice it can

only be written to the cache address it is mapped to.

– In Associative and Set-Associative there is a choice in what to replace.

CSC 370 (Blum) 46

Replacement Policy (Cont.)

• Least Recently Used (LRU)– One approach is to track the use of what is in cache and

to replace the line that is being used the least.– This is best in keeping with the locality of reference

notion behind cache, but it requires a fair amount of overhead.

• Not-Most-Recently Used– Another approach is to choose a line at random except

that one protects the line (from the set) that has been used most recently

– Less overhead

CSC 370 (Blum) 47

Blocking or Non-Blocking Cache

• Replacement requires interacting with a slower type of memory (a lower level of cache or main memory). Do we allow the processor to continue to access cache during this procedure or not?

• This is the distinction between blocking and non-blocking cache. – In blocking, all cache transactions must wait until the

cache has been updated.

– In non-blocking, other cache transaction are possible.

CSC 370 (Blum) 48

Cache Write Policy• The data cache may not only be read but be

written to. But cache is just standing in as a handy representative for main memory. That’s really where one wants to write. This is relatively slow just as reading main memory is relatively slow.

• Rules about when one does this writing to memory is called one’s write policy.

• One reason for separating data cache and instruction cache (Harvard cache) is that the instruction cache does not require a write policy.

CSC 370 (Blum) 49

Write-Back Cache

• Because writing to memory is slow, in Write-Back Cache, a.k.a. "copy back” cache, one waits to until the line of cache is being replaced to write any values back to memory. – Main memory and cache are inconsistent but

the cache value will always be used. In such a case the memory is said to be “stale.”

CSC 370 (Blum) 50

Dirty Bit

• Since writing back to main memory is slow, one only wants to do it if necessary, that is, if some part of line has been updated.

• Each line of cache has a “dirty bit” which tells the cache controller whether or not the line has been updated since it was last replaced. – Only if the dirty bit is flipped does one need to

write back.

CSC 370 (Blum) 51

Pros and Cons of Write Back

• Pro: Write Back takes advantage of the locality of reference concept. If the line of cache is written to, it’s likely to be written to again soon (before it is replaced).

• Con: When one writes back to main memory, one must write the entire line.

CSC 370 (Blum) 52

Write-Through Cache

• With Write-Through Cache, one writes the value back to memory every time a cache line is updated. – Con: Effectively a write-through cache is being used as

a cache (fast stand in for memory) only for purposes of reading and not for writing.

– Pro: When one writes, one is only writing a byte instead of a line. That’s not much of an advantage given the efficiency of burst/page reading/writing when the cache interacts with memory.

– Pro: Integrity: cache and memory always agree

CSC 370 (Blum) 53

Comparing Policies

• Write back is more efficient.

• Write through maintains integrity.

• Integrity is not so much an issue at the SRAM-DRAM interface in the memory hierarchy since both are volatile. – This issue is more important at the next lower

interface main memory/virtual memory as virtual memory is non-volatile.

CSC 370 (Blum) 54

Victim Cache

• Other than write modified data back to memory, what do we do with the data that is being replaced?

• One answer is nothing. • Another possibility is to store it in a buffer that is

faster than the next lower level, effectively introducing another small level of cache. This is as the victim cache or victim buffer.

CSC 370 (Blum) 55

References

• Computer Architecture, Nicholas Carter

• http://www.pcguide.com/ref/mbsys/cache/

• http://www.howstuffworks.com/cache.htm/printable

• http://slcentral.com/articles/00/10/cache/print.php

Documents

CSC 370 (Blum)1 Cache Based in part on Chapter 9 in Computer Architecture (Nicholas Carter)