View
10
Download
0
Category
Preview:
Citation preview
Lecture 17 Slide 1 EECS 470
EECS 470
Lecture 15
Prefetching
Winter 2021
Jon Beaumont
http://www.eecs.umich.edu/courses/eecs470
Prefetch A3
11
Correlating Prediction Table
A3 A0,A1 A0
History Table
Latest
A1
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.
Lecture 17 Slide 2 EECS 470
Administrative
HW #4 due Friday (4/2)
Let me know if there is any issues with other HW on Gradescope
Milestone III next week
• No submissions needed
• Should target to have simple programs (including memory ops) running correctly
• Remaining couple weeks should focus on testing and optimizing
Lecture 17 Slide 3 EECS 470
Last time
Cache techniques to reduce cache misses and cache penalties
Lecture 17 Slide 4 EECS 470
Today
Finish up cache enhancements
Reduce number of cache misses through prefetching
Lecture 17 Slide 5 EECS 470
Large Blocks
Pros of large cache blocks: + Smaller tag overhead
+ Take advantage of spatial locality
Cons: - Takes longer to fill
- Wasted bandwidth if block size is larger than spatial locality
Poll: What are the advantages of large
cache blocks?
Lecture 17 Slide 6 EECS 470
Large Blocks and Subblocking
Can get the best of both worlds Large cache blocks can take a long time to refill
refill cache line critical word first restart cache access before complete refill
Large cache blocks can waste bus bandwidth if block size is larger than spatial locality divide a block into subblocks associate separate valid bits for each subblock Only load subblock on access, but still have reduced tag overhead
tag subblock v subblock v subblock v
Lecture 17 Slide 7 EECS 470
Multi-Level Caches
Processors getting faster w.r.t. main memory larger caches to reduce frequency of more costly misses but larger caches are too slow for processor => gradually reduce miss cost with multiple levels
tavg = thit + miss ratio x tmiss
Lecture 17 Slide 8 EECS 470
Multi-Level Cache Design
L1I L1D
L2
Proc
different technology different requirements different choice of
capacity block size associativity
tavg-L1 = thit-L1 + miss-ratioL1 x tavg-L2
tavg-L2 = thit-L2 + miss-ratioL2 x tmemory
What is miss ratio? global: L2 misses / L1 accesses local: L2 misses / L1 misses
Lecture 17 Slide 9 EECS 470
The Inclusion Property Inclusion means L2 is a superset of L1 (ditto for L3…)
Why? if an addr is in L1, then it must be frequently used
makes L1 writeback simpler
L2 can handle external coherence checks without L1
Inclusion takes effort to maintain L2 must track what is cached in L1
On L2 replacement, must flush corresponding blocks from L1
How can this happen?
Consider:
1. L1 block size < L2 block size
2. different associativity in L1
3. L1 filters L2 access sequence; affects LRU replacement order
Lecture 17 Slide 10 EECS 470
Possible Inclusion Violation
a b a
b
a,b,c have same L1 idx bits b,c have the same L2 idx bits
a,{b,c} have different L2 idx bits
step 1. L1 miss on c
step 2. a displaced to L2
step 3. b replaced by c
c 2-way set asso. L1
direct mapped L2
Lecture 17 Slide 11 EECS 470
Non-blocking Caches
Also known as lock-up free caches
Instead of stalling pending accesses to cache on a miss, keep track of misses in special registers and keep handling new requests
Key implementation problems handle reads to pending miss handle writes to pending miss keep multiple requests straight
Non-blocking
$
ld A hit ld B miss ld C miss ld D hit st B miss (pend.)
Miss Status Holding Registers
B
C
Memory Access Stream
Lecture 17 Slide 12 EECS 470
EECS 470 Roadmap
Parallelize
Speedup Programs
Reduce Instruction Latency Reduce number of instructions
Instruction Level Parallelism Reduce average memory latency
Instruction Flow Caching Memory Flow
Prefetching
Lecture 17 Slide 13 EECS 470 13
The memory wall
Today: 1 mem access 500 arithmetic ops
How to reduce memory stalls for existing SW?
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Perf
orm
an
ce
Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4th ed.
Processor
Memory
Lecture 17 Slide 14 EECS 470
data
Conventional approach #1: Avoid main memory accesses
Cache hierarchies:
Trade off capacity for speed
Add more cache levels?
Diminishing locality returns
No help for shared data in MPs
CPU
64K
4M
Main memory
2 clk
20 clk
200 clk
Write data
CPU
Lecture 17 Slide 15 EECS 470
Conventional approach #2: Hide memory latency
Out-of-order execution:
Overlap compute & mem stalls
Expand OoO instruction window?
Issue & load-store logic hard to scale
No help for dependent instructions
exec
uti
on
compute
mem stall
OoO In order
Lecture 17 Slide 16 EECS 470
What is Prefetching?
• Fetch memory before it's needed
• Targets compulsory, capacity, & coherence misses
Big challenges:
1. knowing “what” to fetch • Fetching useless info wastes valuable resources
2. “when” to fetch it • Fetching too early clutters storage
• Fetching too late defeats the purpose of “pre”-fetching
Lecture 17 Slide 17 EECS 470
Software Prefetching
Compiler/programmer places prefetch instructions requires ISA support
why not use regular loads?
found in recent ISA’s such as SPARC V-9
Prefetch into register (binding)
caches (non-binding)
Lecture 17 Slide 18 EECS 470
Software Prefetching (Cont.)
e.g.,
for (I = 1; I < rows; I++)
for (J = 1; J < columns; J++)
{
prefetch(&x[I+1,J]);
sum = sum + x[I,J];
}
Lecture 17 Slide 19 EECS 470
Hardware Prefetching
What to prefetch? one block spatially ahead?
use address predictors works for regular patterns (x, x+8, x+16,.)
When to prefetch? on every reference
on every miss
when prior prefetched data is referenced
Where to put prefetched data? auxiliary buffers
caches
Poll: Which cache is probably easier to
design a prefetcher for?
Poll: We've already seen one implicit form of prefetching. When?
Lecture 17 Slide 20 EECS 470
Spatial Locality and Sequential Prefetching
Sequential prefetching Just grab the next few lines from memory
Works well for I-cache
Instruction fetching tend to access memory sequentially
Doesn’t work very well for D-cache More irregular access pattern
regular patterns may have non-unit stride (e.g. matrix code)
Relatively easy to implement Large cache block size already have the effect of prefetching
After loading one-cache line, start loading the next line automatically if the line is not in cache and the bus is not busy
If we know the typical basic block size (i.e. avg distance between branches), we can fetch the next several lines
Lecture 17 Slide 21 EECS 470
Access pattern for a particular static load is more predictable
Reference Prediction Table
Remembers previously executed loads, their PC, the last address referenced, stride between the last two references
When executing a load, look up in RPT and compute the distance between the current data addr and the last addr
- if the new distance matches the old stride
found a pattern, go ahead and prefetch “current addr+stride”
- update “last addr” and “last stride” for next lookup
Load Inst. Last Address Last Flags
PC (tag) Referenced Stride
……. ……. ……
Stride Prefetchers
Load
Inst
PC
Lecture 17 Slide 22 EECS 470
Stream Buffers [Jouppi] Each stream buffer holds one stream of sequentially
prefetched cache lines
On a load miss check the head of all stream buffers for an address match
if hit, pop the entry from FIFO, update the cache with data
if not, allocate a new stream buffer to the new miss address
(may have to recycle a stream buffer following LRU policy)
Stream buffer FIFOs are continuously topped-off with subsequent cache lines whenever there is room and
the bus is not busy
Stream buffers can incorporate stride prediction mechanisms to support non-unit-stride streams
FIFO
FIFO
FIFO
FIFO
DCache
Me
mo
ry inte
rface
No cache pollution
Lecture 17 Slide 23 EECS 470
Generalized Access Pattern Prefetchers
How do you prefetch
1. Heap data structures?
2. Indirect array accesses?
3. Generalized memory access patterns?
Current proposals:
• Precomputation prefetchers (runahead execution)
• Address correlating prefetchers (temporal memory streaming)
• Spatial pattern prefetchers (spatial memory streaming)
Lecture 17 Slide 24 EECS 470
Runahead Prefetchers
Proposed for I/O prefetching first (Gibson et al.)
Duplicate the program
• Only execute the address generating stream
• Let it run ahead
May run as a thread on
• A separate processor
• The same multithreaded processor
Or custom address generation logic
Many names: slipstream, precomp., runahead, …
Main Prefetch
Thread Thread
Lecture 17 Slide 25 EECS 470
Runahead Prefetcher
To get ahead:
• Must avoid waiting
• Must compute less
Predict
1. Control flow thru branch prediction
2. Data flow thru value prediction
3. Address generation computation only
+ Prefetch any pattern (need not be repetitive)
― Prediction only as good as branch + value prediction
How much prefetch lookahead?
Lecture 17 Slide 26 EECS 470
Correlation-Based Prefetching Consider the following history of Load addresses emitted by a
processor A, B, C, D, C, E, A, C, F, F, E, A, A, B, C, D, E, A ,B C, D, C
After referencing a particular address (say A or E), are some addresses more likely to be referenced next
A B C
D E F 1.0
.33 .5
.2
1.0 .6 .2
.67 .6
.5
.2
.2
Markov
Model
Lecture 17 Slide 27 EECS 470
Track the likely next addresses after seeing a particular addr.
Prefetch accuracy is generally low so prefetch up to N next addresses to increase coverage (but this wastes bandwidth)
Prefetch accuracy can be improved by using longer history Decide which address to prefetch next by looking at the last K load addresses
instead of just the current one
e.g. index with the XOR of the data addresses from the last K loads
Using history of a couple loads can increase accuracy dramatically
This technique can also be applied to just the load miss stream
Load Data Addr Prefetch Confidence …. Prefetch Confidence
(tag) Candidate 1 …. Candidate N
……. ……. …… .… ……. ……
….
Correlation-Based Prefetching
Load
Data
Addr
Lecture 17 Slide 28 EECS 470
More info on Prefetching?
Professor Wenisch (professor here, currently working at Google) wrote a great summary of the state of the art here (available through umich IP addresses):
https://www.morganclaypool.com/doi/abs/10.2200/S00581ED1V01Y201405CAC028
Lecture 17 Slide 29 EECS 470
Improving Cache Performance: Summary
Miss rate large block size
higher associativity
victim caches
skewed-/pseudo-associativity
hardware/software prefetching
compiler optimizations
Miss penalty give priority to read misses over
writes/writebacks
subblock placement
early restart and critical word first
non-blocking caches
multi-level caches
Hit time (difficult?) small and simple caches
avoiding translation during L1 indexing (later)
pipelining writes for fast write hits
subblock placement for fast write hits in write through caches
Lecture 17 Slide 30 EECS 470
Next Time
Multicore!
Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD
Recommended