EECS 470 Lecture 15 Basic Caches · 2020. 10. 28. · Cache miss rao is 0.01 Cache miss penalty is...

Lecture 17 Slide 1 EECS 470

EECS 470

Lecture 15

Prefetching

Winter 2021

Jon Beaumont

http://www.eecs.umich.edu/courses/eecs470

Prefetch A3

Correlating Prediction Table

A3 A0,A1 A0

History Table

Latest

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.

Administrative

HW #4 due Friday (4/2)

Let me know if there is any issues with other HW on Gradescope

Milestone III next week

• No submissions needed

• Should target to have simple programs (including memory ops) running correctly

• Remaining couple weeks should focus on testing and optimizing

Last time

Cache techniques to reduce cache misses and cache penalties

Finish up cache enhancements

Reduce number of cache misses through prefetching

Large Blocks

Pros of large cache blocks: + Smaller tag overhead

+ Take advantage of spatial locality

Cons: - Takes longer to fill

- Wasted bandwidth if block size is larger than spatial locality

Poll: What are the advantages of large

cache blocks?

Large Blocks and Subblocking

Can get the best of both worlds Large cache blocks can take a long time to refill

refill cache line critical word first restart cache access before complete refill

Large cache blocks can waste bus bandwidth if block size is larger than spatial locality divide a block into subblocks associate separate valid bits for each subblock Only load subblock on access, but still have reduced tag overhead

tag subblock v subblock v subblock v

Multi-Level Caches

Processors getting faster w.r.t. main memory larger caches to reduce frequency of more costly misses but larger caches are too slow for processor => gradually reduce miss cost with multiple levels

tavg = thit + miss ratio x tmiss

Multi-Level Cache Design

L1I L1D

different technology different requirements different choice of

capacity block size associativity

tavg-L1 = thit-L1 + miss-ratioL1 x tavg-L2

tavg-L2 = thit-L2 + miss-ratioL2 x tmemory

What is miss ratio? global: L2 misses / L1 accesses local: L2 misses / L1 misses

The Inclusion Property Inclusion means L2 is a superset of L1 (ditto for L3…)

Why? if an addr is in L1, then it must be frequently used

makes L1 writeback simpler

L2 can handle external coherence checks without L1

Inclusion takes effort to maintain L2 must track what is cached in L1

On L2 replacement, must flush corresponding blocks from L1

How can this happen?

Consider:

1. L1 block size < L2 block size

2. different associativity in L1

3. L1 filters L2 access sequence; affects LRU replacement order

Possible Inclusion Violation

a,b,c have same L1 idx bits b,c have the same L2 idx bits

a,{b,c} have different L2 idx bits

step 1. L1 miss on c

step 2. a displaced to L2

step 3. b replaced by c

c 2-way set asso. L1

direct mapped L2

Non-blocking Caches

Also known as lock-up free caches

Instead of stalling pending accesses to cache on a miss, keep track of misses in special registers and keep handling new requests

Key implementation problems handle reads to pending miss handle writes to pending miss keep multiple requests straight

Non-blocking

ld A hit ld B miss ld C miss ld D hit st B miss (pend.)

Miss Status Holding Registers

Memory Access Stream

EECS 470 Roadmap

Parallelize

Speedup Programs

Reduce Instruction Latency Reduce number of instructions

Instruction Level Parallelism Reduce average memory latency

Instruction Flow Caching Memory Flow

Prefetching

Lecture 17 Slide 13 EECS 470 13

The memory wall

Today: 1 mem access 500 arithmetic ops

How to reduce memory stalls for existing SW?

1985 1990 1995 2000 2005 2010

Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4th ed.

Processor

Memory

Conventional approach #1: Avoid main memory accesses

Cache hierarchies:

Trade off capacity for speed

Add more cache levels?

Diminishing locality returns

No help for shared data in MPs

Main memory

20 clk

200 clk

Write data

Conventional approach #2: Hide memory latency

Out-of-order execution:

Overlap compute & mem stalls

Expand OoO instruction window?

Issue & load-store logic hard to scale

No help for dependent instructions

compute

mem stall

OoO In order

What is Prefetching?

• Fetch memory before it's needed

• Targets compulsory, capacity, & coherence misses

Big challenges:

1. knowing “what” to fetch • Fetching useless info wastes valuable resources

2. “when” to fetch it • Fetching too early clutters storage

• Fetching too late defeats the purpose of “pre”-fetching

Software Prefetching

Compiler/programmer places prefetch instructions requires ISA support

why not use regular loads?

found in recent ISA’s such as SPARC V-9

Prefetch into register (binding)

caches (non-binding)

Software Prefetching (Cont.)

for (I = 1; I < rows; I++)

for (J = 1; J < columns; J++)

prefetch(&x[I+1,J]);

sum = sum + x[I,J];

Hardware Prefetching

What to prefetch? one block spatially ahead?

use address predictors works for regular patterns (x, x+8, x+16,.)

When to prefetch? on every reference

on every miss

when prior prefetched data is referenced

Where to put prefetched data? auxiliary buffers

caches

Poll: Which cache is probably easier to

design a prefetcher for?

Poll: We've already seen one implicit form of prefetching. When?

Spatial Locality and Sequential Prefetching

Sequential prefetching Just grab the next few lines from memory

Works well for I-cache

Instruction fetching tend to access memory sequentially

Doesn’t work very well for D-cache More irregular access pattern

regular patterns may have non-unit stride (e.g. matrix code)

Relatively easy to implement Large cache block size already have the effect of prefetching

After loading one-cache line, start loading the next line automatically if the line is not in cache and the bus is not busy

If we know the typical basic block size (i.e. avg distance between branches), we can fetch the next several lines

Access pattern for a particular static load is more predictable

Reference Prediction Table

Remembers previously executed loads, their PC, the last address referenced, stride between the last two references

When executing a load, look up in RPT and compute the distance between the current data addr and the last addr

- if the new distance matches the old stride

found a pattern, go ahead and prefetch “current addr+stride”

- update “last addr” and “last stride” for next lookup

Load Inst. Last Address Last Flags

PC (tag) Referenced Stride

……. ……. ……

Stride Prefetchers

Stream Buffers [Jouppi] Each stream buffer holds one stream of sequentially

prefetched cache lines

On a load miss check the head of all stream buffers for an address match

if hit, pop the entry from FIFO, update the cache with data

if not, allocate a new stream buffer to the new miss address

(may have to recycle a stream buffer following LRU policy)

Stream buffer FIFOs are continuously topped-off with subsequent cache lines whenever there is room and

the bus is not busy

Stream buffers can incorporate stride prediction mechanisms to support non-unit-stride streams

DCache

ry inte

No cache pollution

Generalized Access Pattern Prefetchers

How do you prefetch

1. Heap data structures?

2. Indirect array accesses?

3. Generalized memory access patterns?

Current proposals:

• Precomputation prefetchers (runahead execution)

• Address correlating prefetchers (temporal memory streaming)

• Spatial pattern prefetchers (spatial memory streaming)

Runahead Prefetchers

Proposed for I/O prefetching first (Gibson et al.)

Duplicate the program

• Only execute the address generating stream

• Let it run ahead

May run as a thread on

• A separate processor

• The same multithreaded processor

Or custom address generation logic

Many names: slipstream, precomp., runahead, …

Main Prefetch

Thread Thread

Runahead Prefetcher

To get ahead:

• Must avoid waiting

• Must compute less

Predict

1. Control flow thru branch prediction

2. Data flow thru value prediction

3. Address generation computation only

+ Prefetch any pattern (need not be repetitive)

― Prediction only as good as branch + value prediction

How much prefetch lookahead?

Correlation-Based Prefetching Consider the following history of Load addresses emitted by a

processor A, B, C, D, C, E, A, C, F, F, E, A, A, B, C, D, E, A ,B C, D, C

After referencing a particular address (say A or E), are some addresses more likely to be referenced next

D E F 1.0

.33 .5

1.0 .6 .2

.67 .6

Markov

Track the likely next addresses after seeing a particular addr.

Prefetch accuracy is generally low so prefetch up to N next addresses to increase coverage (but this wastes bandwidth)

Prefetch accuracy can be improved by using longer history Decide which address to prefetch next by looking at the last K load addresses

instead of just the current one

e.g. index with the XOR of the data addresses from the last K loads

Using history of a couple loads can increase accuracy dramatically

This technique can also be applied to just the load miss stream

Load Data Addr Prefetch Confidence …. Prefetch Confidence

(tag) Candidate 1 …. Candidate N

……. ……. …… .… ……. ……

Correlation-Based Prefetching

More info on Prefetching?

Professor Wenisch (professor here, currently working at Google) wrote a great summary of the state of the art here (available through umich IP addresses):

https://www.morganclaypool.com/doi/abs/10.2200/S00581ED1V01Y201405CAC028

Improving Cache Performance: Summary

Miss rate large block size

higher associativity

victim caches

skewed-/pseudo-associativity

hardware/software prefetching

compiler optimizations

Miss penalty give priority to read misses over

writes/writebacks

subblock placement

early restart and critical word first

non-blocking caches

multi-level caches

Hit time (difficult?) small and simple caches

avoiding translation during L1 indexing (later)

pipelining writes for fast write hits

subblock placement for fast write hits in write through caches

Next Time

Multicore!

Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD

EECS 470 Lecture 15 Basic Caches · 2020. 10. 28. · Cache miss rao is 0.01 Cache miss penalty is...

Documents

JME Fevereiro

Memória - profalansantos.files.wordpress.com · ocorre um hit (acerto). –Quando a informação não está na cache, ocorre um miss (falta). 22 Memória Cache • Para que a cache

JME - aral.shahroodut.ac.iraral.shahroodut.ac.ir/article_1078_c523cebee2a19cf525618f386eedf… · JME Journal of Mining & Environment, Vol.9, No.2, 2018, 527-537. DOI: 10.22044/jme.2017.6220.1439

JME Janeiro

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism

JME - Abril

CSE 30321 – Lecture 20 – Improving Cache Performance ...mniemier/teaching/2010_B_Fall/... · CSE 30321 – Lecture 20 – Improving Cache Performance! Addressing Miss Penalties!

Caches IV - courses.cs.washington.edu · L19: Caches IV CSE351, Spring 2019 Cache Miss Analysis (Blocked) Scenario Parameters: Cache block size 𝐾= 64 B = 8 doubles Cache size 𝐶≪𝑛(much

JME - Setembro

Cache Memory - Trinity College Dublin caches.pdf · update cache line and main memory write miss update main memory ONLY [non write allocate cache] OR select a cache line [using replacement

Revisiting the Cache Miss Analysis of Multithreaded Algorithmsvlr/papers/latin12-full.pdf · 2016. 2. 5. · Revisiting the Cache Miss Analysis of Multithreaded Algorithms Richard

Cache Miss Detection in a Data Processing Apparatus - Patent Application

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main

JME - Fevereiro

JME NEW JME - Shop Radiography and NDT Products Stock List - February... · JME Ltd - JME Ltd - JME Ltd - JME Ltd - JME Ltd - JME Ltd - JME Ltd ... BB250 Flash X-ray Unit & Control

jme - julho

Lecture Objectives: 1)Explain the relationship between miss rate and block size in a cache. 2)Construct a flowchart explaining how a cache miss is handled

Memory Hierarchy Design Memory Hierarchy Design. 2 Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss

Expected Values for Cache Miss Rates for a Single Trace

An Imitation Learning Approach for Cache Replacement · added to the cache (i.e., due to a cache miss), an existing cache line must be evicted from the cache to make space for the