EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches

EEM 486

EEM 486: Computer Architecture

Lecture 6

Memory Systems and Caches

Lec 6.2

The Five Classic Components of a Computer

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Output

Lec 12.3

Processor

MainMemory

Memory

reference stream

<op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .

op: i-fetch, read, write

Optimize the memory system organization

to minimize the average memory access time

for typical workloads

Workload orBenchmarkprograms

The Art of Memory System Design

Lec 6.4

Technology Trends

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

1000:1! 2:1!

Lec 6.5

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

1000198

Processor-Memory performance gap grows 50% / year

“Moore’s Law”

“Less’ Law?”

Processor-DRAM Memory Gap

Lec 6.6

The Goal: illusion of large, fast, cheap memory

• Large memories are slow but cheap (DRAM)

• Fast memories are small yet expensive (SRAM)

How do we create a memory that is large, fast and cheap?

• Memory hierarchy

• Parallelism

Lec 6.7

The Principle of Locality

The principle of locality: Programs access a relatively small

portion of their address space at any instant of time

Temporal Locality (Locality in Time)

=> If an item is referenced, it will tend to be referenced again soon

=> Keep most recently accessed data items closer to the processor

Spatial Locality (Locality in Space)

=> If an item is referenced, nearby items will tend to be referenced soon

=> Move blocks of contiguous words to the upper levels

Q: Why does code have locality?

Lec 6.8

Memory Hierarchy Based on the principle of locality

A way of providing large, cheap, and fast memory

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

1s 10,000,000s

(10s ms)

Speed (ns): 10s 100s

100s GsSize (bytes): Ks Ms

TertiaryStorage(Tape)

10,000,000,000s (10s sec)

$ per Mbyte

increases

Lec 6.9

Cache Memory

CacheCPU Memoryword block

Tag Block

Block length (K words)

CACHE MEMORY

Lec 6.10

Elements of Cache Design Cache size

Mapping function• Direct

• Set Associative

• Fully Associative

Replacement algorithm• Least recently used (LRU)

• First in first out (FIFO)

• Random

Write policy• Write through

• Write back

Line size

Number of caches• Single or two level

• Unified or split

Lec 6.11

Terminology Hit: data appears in some block in the upper level

• Hit Rate: the fraction of memory accesses found in the upper level

• Hit Time: time to access the upper level which consists of

RAM access time + Time to determine hit/miss

X4Cache

Processor

(2)(1) Xn

Memory

Upper level

Lower level

Read hit

Lec 6.12

Terminology Miss: data needs to be retrieved from a block

in the lower level

• Miss Rate = 1 - (Hit Rate)

• Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

Hit Time << Miss Penalty

X4Cache

Processor

(4)(1) Xn

Memory

Upper level

Lower level

Read miss

(3)(2)

Lec 6.13

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

Memory

Each memory location is mapped to exactly one location

in the cache:

Cache block # = (Block address) modulo (# of cache blocks)

= Low order log2 (# of cache blocks) bits of the address

Lec 6.14

64 KByte Direct Mapped Cache

• Why do we need a Tag field?

• Why do we need a Valid bit field?

• What kind of locality are we taking care of?

• Total number of bits in a cache

2^n x (|valid| + |tag| + |block|)

2^n : # of cache blocks

|valid| = 1 bit

|tag| = 32 – (n + 2); 32-bit byte address 1 word blocks

|block| = 32 bit

Address (showing bit positions)

16 14 Byteoffset

Valid Tag Data

Hit Data

16Kentries

16 bits 32 bits

31 30 17 16 15 5 4 3 2 1 0

Lec 6.15

Address the cache by PC or ALU

If the cache signals hit, we have a read hit

• The requested word will be on the data lines

Otherwise, we have a read miss

• stall the CPU

• fetch the block from memory and write into cache

• restart the execution

Reading from Cache

Lec 6.16

Writing to Cache

If the cache signals hit, we have a write hit

• We have two options:

- write-through: write the data into both cache and memory

- write-back: write the data only into cache and

write it into memory only when it is replaced

Otherwise, we have a write miss

• Handle write miss as if it were a write hit

Lec 6.17

Taking advantage of spatial locality

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

4Kentries

16 bits 128 bits

32 32 32

Block offsetIndex

31 16 15 4 32 1 0

64 KByte Direct Mapped Cache

Lec 6.18

Writing to Cache

If the cache signals hit, we have a write hit

• Write-through cache: write the data into both cache and memory

Otherwise, we have a write miss

• stall the CPU

• fetch the block from memory and write into cache

• restart the execution and rewrite the word

Lec 6.19

Associativity in Caches

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data

Four-way set associative

Tag Data

One-way set associative(direct mapped)

Tag Data

Two-way set associative

Tag Data

Compute the set number:

(Block number) modulo (Number of sets)

Choose one of the blocks in the computed set

Lec 6.20

Set Asscociative Cache

Address

V TagIndex012

253254255

Data V Tag Data V Tag Data V Tag Data

4-to-1 multiplexor

Hit Data

123891011123031 0

N-way set associative• N direct mapped caches operates in parallel

• N entries for each cache index

• N comparators and a N-to-1 mux

• Data comes AFTER Hit/Miss decision and set selection

A four-way set associative cache

Lec 6.21

Fully Associative Cache

A block can be anywhere in the cache => No Cache Index

Compare the Cache Tags of all cache entries in parallel

Practical for small number of cache blocks

Cache Data

Byte 0

Cache Tag (27 bits long)

Valid Bit

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

Lec 6.22

Q1: Block placement?

Where can a block be placed in the upper level?

Q2: Block identification?

How is a block found if it is in the upper level?

Q3: Block replacement?

Which block should be replaced on a miss?

Q4: Write strategy?

What happens on a write?

Four Questions for Caches

Lec 6.23

Block 12 to be placed in an 8 block cache:

Q1: Block Placement?

Any block(12 mod 8) = 4 Only block 4

0 1 2 3 4 5 6 7Blockno.

Fully associative

0 1 2 3 4 5 6 7Blockno.

Direct mapped

0 1 2 3 4 5 6 7Blockno.

Set associative

(12 mod 4) = 0 Any block in set 0

Direct mapped: One place - (Block address) mod (# of cache blocks)

Set associative: A few places - (Block address) mod (# of cache sets)

# of cache sets = # of cache blocks/degree of associativity

Fully associative: Any place

Lec 6.24

Q2: Block Identification?

Blockoffset

Block AddressTag Index

Set Select

Data Select

Direct mapped: Indexing – index, 1 comparison

N-way set associative: Limited search – index the set, N comparison

Fully associative: Full search – search all cache entries

Lec 6.25

Easy for Direct Mapped

Set Associative or Fully Associative:• Random: Randomly select one of the blocks in the set

• LRU (Least Recently Used): Select the block in the set which has been

unused for the longest time

Associativity: 2-way 4-way 8-way

Size LRU Random LRU Random LRU Random

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q3: Replacement Policy on a Miss?

Lec 6.26

Write through— The information is written to both the block in the cache and to the block in the lower-level memory

Write back— The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced

• is block clean or dirty?

Pros and Cons of each?• WT: read misses cannot result in writes

• WB: no writes of repeated writes

WT always combined with write buffers to avoid

waiting for lower level memory

Q4: Write Policy?

Lec 6.27

Cache PerformanceCPU time = (CPU execution clock cycles +

Memory stall clock cycles) x Cycle time

Note: memory hit time is included in execution cycles

Stalls due to cache misses:

Memory stall clock cycles = Read-stall clock cycles +

Write-stall clock cycles

Read-stall clock cycles= Reads x Read miss rate x Read miss penalty

Write-stall clock cycles= Writes x Write miss rate x Write miss penalty

If read miss penalty = write miss penalty,

Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty

Lec 6.28

Cache PerformanceCPU time = Instruction count x CPI x Cycle time

= Inst count x Cycle time x

(ideal CPI + Memory stalls/Inst + Other stalls/Inst)

Memory Stalls/Inst =

Instruction Miss Rate x Instruction Miss Penalty +

Loads/Inst x Load Miss Rate x Load Miss Penalty +

Stores/Inst x Store Miss Rate x Store Miss Penalty

Average Memory Access time (AMAT) =

Hit Time + (Miss Rate x Miss Penalty) =

(Hit Rate x Hit Time) + (Miss Rate x Miss Time)

Lec 6.29

Example Suppose a processor executes at

• Clock Rate = 200 MHz (5 ns per cycle)• Base CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memory operations get 50 cycle miss penalty

Suppose that 1% of instructions get same miss penalty

CPI = Base CPI + average stalls per instruction = 1.1(cycles/ins) +

[ 0.30 (Data Mops/ins) x 0.10 (miss/Data Mop) x 50 (cycle/miss)] +

[ 1 (Inst Mop/ins) x 0.01 (miss/Inst Mop) x 50 (cycle/miss)]

= (1.1 + 1.5 + .5) cycle/ins = 3.1

AMAT= (1/1.3)x[1+0.01x50]+ (0.3/1.3)x[1+0.1x50]= 2.54

Lec 6.30

Options to reduce AMAT:

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache

CPU Time = IC x CT x (ideal CPI + memory stalls)

Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty)

= (Hit Rate x Hit Time) + (Miss Rate x Miss Time)

Improving Cache Performance

Lec 6.31

Block Size (bytes)

Miss Rate

Reduce Misses: Larger Block Size

Increasing block size also increases miss penalty !

Lec 6.32

Reduce Misses: Higher Associativity

Eight-wayFour-wayTwo-wayOne-way

Associativity 16 KB

128 KB

Increasing associativity also increases both time and hardware cost !

Lec 6.33

L2 Equations

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +

Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

Reducing Penalty: Second-Level Cache

L1 Cache

L2 Cache

Lec 6.34

Simple: • CPU, Cache, Bus,

Memory same width (32 bits)

Interleaved: • CPU, Cache, Bus- 1 word

• N Memory Modules

Wide: • CPU/Mux 1 word;

Mux/Cache, Bus, Memory N words

Designing the Memory System to Support Caches

Lec 6.35

DRAM (Read/Write) Cycle Time >>

DRAM (Read/Write) Access Time

DRAM (Read/Write) Cycle Time :• How frequent can you initiate an access?

DRAM (Read/Write) Access Time:• How quickly will you get what you want once you initiate an

access?

DRAM Bandwidth Limitation

TimeAccess Time

Cycle Time

Main Memory Performance

Lec 6.36

Access Pattern without Interleaving: CPU Memory

Start Access for D1 Start Access for D2

D1 available

Access Pattern with 4-way Interleaving:

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

MemoryBank 1

MemoryBank 0

MemoryBank 3

MemoryBank 2

Increasing Bandwidth - Interleaving

Lec 6.37

Summary #1/2 The Principle of Locality:

• Program likely to access a relatively small portion of the address space at any instant of time.

- Temporal Locality: Locality in Time

- Spatial Locality: Locality in Space

Three (+1) Major Categories of Cache Misses:• Compulsory Misses: sad facts of life. Example: cold start misses.

• Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!

• Capacity Misses: increase cache size

Cache Design Space• total size, block size, associativity

• replacement policy

• write-hit policy (write-through, write-back)

• write-miss policy

Lec 6.38

Summary #2/2: The Cache Design Space Several interacting dimensions

• cache size

• block size

• associativity

• replacement policy

• write-through vs write-back

• write allocation

The optimal choice is a compromise• depends on access characteristics

- workload

- use (I-cache, D-cache, TLB)

• depends on technology / cost

Simplicity often wins

Associativity

Cache Size

Block Size

Less More

Factor A Factor B

EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches

Documents

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213

BACHILLERATOS POPULARES Y OFERTA DE ... - Buenos Aires · adu 2000 eem 01/02 eem 05/19 eem 02/04 eem 01/05 eem 01/03 cens 21 ute cens 62 sec cens 41 (s) cens 54 afip ... gobierno

Cpu Caches

5.3. eem 20111129

CPU Caches and Why You Care - Scott Meyers Caches and Why You Care CPU Caches Three common types:

Eem passe c4

EEM - Copy

Lec15 snoop coherence - University of Cretehy425/2016f/lectures/Lec15_snoop... · 2016. 12. 16. · Proc1 Proc2 Proc4 Caches Caches Caches Single’Bus Memory I/O Proc3 Caches. Multiprocessor’CacheCoherency!

EEM 486: Computer Architecture Lecture 3 Designing a Single Cycle Datapath

EXISTING - LoopNet · 2018. 12. 6. · 486 486 486 486 486 486 486 LL 109 486 486 486 486 486 486 486 Main Electrical closet Copy/print Copy/print 4'-1 1 2 " TMG FS 5' Display and

Cadeaux Caches

Vállalati EEM

EEM 130120119126

Advanced Caches

EEM 486 : Computer Architecture Lecture 3 Designing Single Cycle Control

EEM 486 EEM 486: Computer Architecture Lecture 1 Course Introduction and the Five Components of a Computer

HIMOINSA EEM

Configuring NetFlow Aggregation Caches · Configuring NetFlow Aggregation Caches ThismodulecontainsinformationaboutandinstructionsforconfiguringNetFlowaggregationcaches.The

1 Mots caches * 2 - boutdegomme.fr · MAGIQUE . Mots caches * Mots caches * s s 3 4 . Mots caches 6 Mots caches s s 5 FAIM PREFERE GATEAU CHIEN CAUCHEMAR MANGER BABA CHOU VOMIR

Trace Caches