Upload
mannanabdulsattar
View
217
Download
0
Embed Size (px)
Citation preview
8/13/2019 ACA Unit-5
1/54
UNIT-V
Memory Hierarchy Design
8/13/2019 ACA Unit-5
2/54
2
Memory Hierarchy Design
5.1 Introduction
5.2 Review of the ABCs of Caches
5.3 Cache Performance
5.4 Reducing Cache Miss Penalty
5.5 Reducing Cache Miss Rate5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism
5.7 Reducing Hit Time
5.8 Main Memory and Organizations for Improving Performance
5.9 Memory Technology5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
8/13/2019 ACA Unit-5
3/54
3
The five classic components of a computer:
Control
Datapath
Memory
Processor
Input
Output
Where do we fetch instructions to execute?
Build a memory hierarchy which includes main memory & caches (internalmemory) and hard disk (external memory)
Instructions are first fetched from external storage such as hard disk andare kept in the main memory. Before they go to the CPU, they areprobably extracted to stay in the caches
8/13/2019 ACA Unit-5
4/54
4
Technology Trends
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
2000 256 Mb 100 ns
Capacity Speed (latency)
CPU: 2x in 1.5 years 2x in 1.5 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
4000:1! 2.5:1!
8/13/2019 ACA Unit-5
5/54
5
The gap (latency) grows about 50% per year!
CPU1.35X/yr1.55X/yr
Memory7%/yr
Performance Gap between CPUs and Memory
(improvementratio)
8/13/2019 ACA Unit-5
6/54
6
Levels of the Memory Hierarchy
CPU Registers
500 bytes0.25 ns
Cache
64 KB
1 ns
Main Memor y
512 MB100ns
Disk100 GB5 ms
Capacity
Acc ess Time
Upper Level
Lower Level
Faster
Larger
Memory Hierarchy
Speed
Capacity
Registers
Cache
Memory
I/O Devices
Blocks
Pages
Files
???
8/13/2019 ACA Unit-5
7/54
7
Cache:
In this textbook it mainly means the first level of the memoryhierarchy encountered once the address leaves the CPU
applied whenever buffering is employed to reuse commonlyoccurring items, i.e. file caches, name caches, and so on
Principle of Locality:
Program access a relatively small portion of the address space atany instant of time.
Two Different Types of Locality:
Temporal Locality(Locality in Time): If an item is referenced, it willtend to be referenced again soon (e.g., loops, reuse)
Spatial Locality(Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon(e.g., straightline code, array access)
ABCs of Caches
8/13/2019 ACA Unit-5
8/54
8
Memory Hierarchy: Terminology
Hit: data appears in some block in the cache (example: Block X) Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main
memory (Block Y)
Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in cache
+ Time to deliver the block to the processor Hit Time
8/13/2019 ACA Unit-5
9/54
8/13/2019 ACA Unit-5
10/54
10
ExampleAssume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% ofthe instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,
how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then
CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle timeThe performance ration is easy to get to be the inverse of the CPU execution
time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.
P.395 Example
8/13/2019 ACA Unit-5
11/54
11
Four Memory Hierarchy Questions
Q1 (block placement):Where can a block be placed in the upper level?
Q2 (block identification):
How is a block found if it is in the upper level?
Q3 (block replacement):Which bock should be replaced on a miss?
Q4 (write strategy):
What happens on a write?
8/13/2019 ACA Unit-5
12/54
12
Q1(block placement): Where can a block be placed?
Direct mapped: (Block number)mod (Number of blocks in cache)Set associative: (Block number)mod (Number of sets in cache)
# of set # of blocks n-way: n blocks in a set 1-way = direct mapped
Fully associative:# of set = 1
Example: block 12 placedin a 8-block cache
8/13/2019 ACA Unit-5
13/54
13
Simplest Cache: Direct Mapped (1-way)
Memory
4 Block Direct Mapped Cache
Block number
0
1
2
3
4
5
6
7
8
9
A
BC
D
E
F
Block Index in Cache
0
1
2
3
The block have only one place it can appear in thecache. The mapping is usually
(Block address) MOD ( Number of blocks in cache)
8/13/2019 ACA Unit-5
14/54
14
Example: 1 KB Direct Mapped Cache, 32B Blocks
For a 2Nbyte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2M
)
0
1
23
:
Cache Data
Byte 0
:
0x50
Stored as partof the cache state
Valid Bit
:
31
Byte 1Byte 31 :Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Cache Index
0431
Cache Tag Example: 0x50
Ex: 0x01
Byte Select
Ex: 0x00
9
8/13/2019 ACA Unit-5
15/54
15
Block Offset selects the desired data from the block, the index filed selectsthe set, and the tag field compared against the CPU address for a hit Use the Cache Indexto select the cache set Check the Tagon each block in that set
No need to check index or block offsetA val id bitis added to the Tag to indicate whether or not this entry
contains a valid address Select the desiredbytes using Block Offset
Increasing associativity => shrinks index expands tag
Block Address Block Offset
(Block Size)Tag Cache/Set Index
Three portions of an address in a set-associative or direct-mapped cache
Q2 (block identification): How is a block found?
8/13/2019 ACA Unit-5
16/54
16
Example: Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared in parallel Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Cache Index
0431
Cache Tag Example: 0x50
Ex: 0x01
Byte Select
Ex: 0x00
9
0x50
8/13/2019 ACA Unit-5
17/54
17
Disadvantage of Set Associative Cache
N-way Set Associative Cache v.s. Direct Mapped Cache:
N comparators vs. 1
Extra MUX delay for the data Data comes AFTERHit/Miss
In a direct mapped cache, Cache Block is available BEFOREHit/Miss:
Possible to assume a hit and continue. Recover later if miss.
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
8/13/2019 ACA Unit-5
18/54
18
Easy for Direct Mappedhardware decisions are simplified
Only one block frame is checked and only that block can be replacedSet Associative or Fully Associative
There are many blocks to choose from on a miss to replace
Three primary strategies for selecting a block to be replaced Random: randomly selected
LRU: Least Recently Used block is removed FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategiesAssociativity: 2-way 4-way 8-waySize LRU Random FIFO LRU Random FIFO LRU Random FIFO
16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.464 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, withLRU outperforming the others for smaller caches. FIFO generally outperforms
random in the smaller cache sizes
Q3 (block replacement): Which block should bereplaced on a cache miss?
8/13/2019 ACA Unit-5
19/54
19
Reads dominate processor cache accesses.
E.g. 7% of overallmemory traffic are writes while 21% of datacacheaccess are writesTwo option we can adopt when writing to the cache:
Write throughThe information is written to both the block in thecache and to the block in the lower-level memory.Write backThe information is written only to the block in the cache.
The modified cache block is written to main memory only when it isreplaced.
To reduce the frequency of writing back blocks on replacement, a dirtybitis used to indicate whether the block was modified in the cache(dirty) or not (clean). If clean, no write back since identical information
to the cache is foundPros and Cons
WT: simply to be implemented. The cache is always clean, so readmisses cannot result in writesWB: writes occur at the speed of the cache. And multiple writes within
a block require only one write to the lower-level memory
Q4(write strategy): What happens on a write?
8/13/2019 ACA Unit-5
20/54
8/13/2019 ACA Unit-5
21/54
21
Two options on a write miss
Write allocatethe block is allocated on a write miss, followedby the write hit actions
Write misses act like read misses
No-write allocatewrite misses do not affect the cache. Theblock is modified only in the lower-level memory
Block stay out of the cache in no-write allocateuntil the program triesto read the blocks, but with write allocateeven blocks that are only
written will still be in the cache
Write-Miss Policy: Write Allocate vs. Not Allocate
8/13/2019 ACA Unit-5
22/54
22
Example:Assume a fully associative write-back cache with many
cache entries that starts empty. Below is sequence of five memoryoperations.
Write Mem[100];Write Mem[100];Read Mem[200];Write Mem[200];Write Mem[100].
What are the number of hits and misses (inclusive reads and writes) whenusing no-write allocateversus write allocate?
Answer :
No-write Allocate: Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write missWrite Mem[100]; 1 write miss Write Mem[100]; 1 write hitRead Mem[200]; 1 read miss Read Mem[200]; 1 read missWrite Mem[200]; 1 write hit Write Mem[200]; 1 write hitWrite Mem[100]. 1 write miss Write Mem[100]; 1 write hit
4 misses; 1 hit 2 misses; 3 hits
Write-Miss Policy Example
8/13/2019 ACA Unit-5
23/54
23
Example: Split Cache vs. Unified Cache
Which has the better avg. memory access time?A 16-KB instruction cache with a 16-KB data cache (split cache), orA 32-KB unified cache?
Miss rates Size Instruction Cache Data Cache Unified Cache16KB 0.4% 11.4%
32 KB 3.18%
Cache Performance
AssumeA hit takes 1 clock cycle and the miss penalty is 100 cyclesA load or store takes 1 extra clock cycle on a unified cache since
there is only one cache port 36% of the instructions are data transfer instructions.
About 74% of the memory accesses are instruction references
Answer :Average memory access time (split)
= % instructions x (Hit time + Instruction miss rate x Miss penalty)+ % data x (Hit time + Instruction miss rate x Miss penalty)= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24
Average memory access time(unified)= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
8/13/2019 ACA Unit-5
24/54
24
Example:Suppose a processor:
Ideal CPI = 1.0 (ignoring memory stalls)
Avg. miss rate is 2%Avg. memory references per instruction is 1.5 Miss penalty is 100 cycles
What are the impact on performance when behavior of the cache is included?
Answer :CPI = CPU execution cycles per instr. + Memory stall cycles per instr.= CPI execution + Miss rate x Memory accesses per instr. x Miss penalty
CPI with cache= 1.0 + 2% x 1.5 x 100 = 4CPI without cache= 1.0 + 1.5 x 100 = 151
CPU time with cache= IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle timeCPU time without cache= IC x 151 x Clock cycle time
Without cache, the CPI of the processor increases from 1 to 151!
75 % of the time the processor is stalled waiting for memory! (CPI: 14)
Impact of Memory Access on CPU Performance
8/13/2019 ACA Unit-5
25/54
25
Example:What is the impact of two different cache organizations (direct
mapped vs. 2-way set associative) on the performance of a CPU?
Ideal CPI = 2.0 (ignoring memory stalls)
Clock cycle time is 1.0 nsAvg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to
accommodate the selection multiplexer Cache miss penalty is 75 ns
Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.
Answer :
Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 nsAvg. memory access time
2-way
= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns
CPU time1-way= IC x (CPIexecution + Miss rate x Memory accesses per instructionx Miss penalty) x Clock cycle time
= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 ICCPU time2-way= IC x (2.0 x 1.0 x 1.25+ (1.5 x 0.01 x 75)) = 3.63 IC
Impact of Cache Organizations on CPU Performance
8/13/2019 ACA Unit-5
26/54
26
Summary of Performance Equations
8/13/2019 ACA Unit-5
27/54
27
The next few sections in the text book look at ways to improve cache
and memory access times.
TimeCycleClockPenaltyMissRateMissnInstructio
AccessesMemoryCPIICTimeCPU Execution )(*
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Section 5.5 Section 5.4Section 5.7
Improving Cache Performance
8/13/2019 ACA Unit-5
28/54
28
Reducing Cache Miss Penalty
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
Time to handle a miss is becoming more and more the
controlling factor. This is because of the great improvement inspeed of processors as compared to the speed of memory.
Five optimizations1. Multilevel caches2. Critical word first and early restart3. Giving priority to read misses over writes
4. Merging write buffer5. Victim caches
M ltil l C h
8/13/2019 ACA Unit-5
29/54
29
Approaches
Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap
L1: fast hits, L2: fewer misses L2 Equations
Average Memory Access Time = Hit TimeL1+ Miss RateL1x Miss PenaltyL1Miss PenaltyL1= Hit TimeL2+ Miss RateL2x Miss PenaltyL2
Average Memory Access Time = Hit TimeL1
+Miss RateL1x (Hit TimeL2+Miss RateL2x Miss PenaltyL2)Hit TimeL1
8/13/2019 ACA Unit-5
30/54
30
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cacheshould be much bigger than L1
Whether data in L1 is in L2 novice approach: design L1 and L2 independently
multilevel inclusion: L1 data are always present in L2 Advantage: easy for consistency between I/O and cache (checking L2 only)
Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-levelblock to be replaced => slightly higher 1st-level miss rate
i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
multilevel exclusion: L1 data is never found in L2 A cache miss in L1 results in a swap of blocks between L1 and L2
Advantage: prevent wasting space in L2
i.e. AMD Athlon: 64 KB L1 and 256 KB L2
8/13/2019 ACA Unit-5
31/54
31
Dont wait for full block to be loaded before restarting CPU
Critical Word FirstRequest missed word first from memoryand send it to CPU as soon as it arrives; let CPU continueexecution while filling the rest of the words in the block. Alsocalled wrapped fetchand requested word first
Early restartAs soon as the requested word of the blockarrives, send it to the CPU and let the CPU continue execution Given spatial locality, CPU tends to want next sequential word, so its
not clear if benefit by early restart
Generally useful only in large blocks,
block
O2: Critical Word First and Early Restart
8/13/2019 ACA Unit-5
32/54
32
Serve reads before writes have been completed Write through with write buffers
SW R3, 512(R0) ; M[512]
8/13/2019 ACA Unit-5
33/54
33
O4: Merging Write Buffer If a write buffer is empty, the data and the full address are
written in the buffer, and the write is finished from the CPUs
perspective Usually a write buffer supports multi-words
Write merging: addresses of write buffers are checked to see ifthe address of the new data matches the address of a validwrite buffer entry. If so, the new data are combined
Write buffer with 4 entries, each can hold four 64-bit words(left) without merging (right) Four writes are merged into a single entry
writing multiple words at the same time is faster than writing multiple times
O5 Vi ti C h
8/13/2019 ACA Unit-5
34/54
34
O5: Victim Caches
Idea of recycling: remember what was discarded latest due tocache miss in case it is needed again rather simply discarded or swapped into L2
victim cache: a small, fully associative cache between a cacheand its refill pathcontain only blocks that are discarded from a cache because of a miss,
victimschecked on a miss before going to the next lower-level memoryVictim caches of 1 to 5 entries are effective at reducing misses,
especially for small, direct-mapped data caches
AMD Athlon: 8 entries
8/13/2019 ACA Unit-5
35/54
35
Reducing Miss Rate
3 Cs of Cache Miss
CompulsoryThe first access to a block is not in the cache, so the blockmust be brought into the cache. Also called cold start missesor first
reference misses.
(Misses in even an Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed during
execution of a program, capacity misseswill occur due to blocks being
discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
ConflictIf block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occurbecause a block can be discarded and later retrieved if too many blocks
map to its set. Also called collision missesor interference misses.
(Misses in N-way Associative but hits in Fully Associative Size X Cache)
8/13/2019 ACA Unit-5
36/54
36
3 Cs of Cache Miss
miss rate 1-way associative cache size X= miss rate 2-way associative cache size X/2
Compulsory vanishingly
small
3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule
Conflict
Cache Size (KB)
MissRateperType
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 816
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
8/13/2019 ACA Unit-5
37/54
37
3Cs Relative Miss Rate
Conflict
Flaws: for fixed block size
Good: insight => invention
Cache Size (KB)
MissRateperTy
pe
0%
20%
40%
60%
80%
100%
1 2 4 816
32
64
128
1-way
2-way4-way
8-way
Capacity
Compulsory
8/13/2019 ACA Unit-5
38/54
38
Five Techniques to Reduce Miss Rate
1. Larger block size
2. Larger caches
3. Higher associativity4. Way prediction and pseudoassociative caches
5. Compiler optimizations
8/13/2019 ACA Unit-5
39/54
39
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
16
32
64
128
256
1K
4K
16K
64K
256K
Size of Cache
Using the principle of
locality: The larger theblock, the greater thechance parts of it will beused again.
O1: Larger Block Size
Take advantage of spatial locality
-The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is
small Usually high latency and high bandwidth encourage large block size
8/13/2019 ACA Unit-5
40/54
40
O2: Larger Caches
Increasing capacity of cache reduces capacity misses(Figure 5.14 and 5.15)
May be longer hit time and higher cost
Trends: Larger L2 or L3 off-chip caches
Cache Size (KB)
MissRateperType
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 816
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
8/13/2019 ACA Unit-5
41/54
41
Figure 5-14 and 5-15 show how improve miss rates improve
with higher associativity 8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule:
Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2
Tradeoff: higher associative cache complicates the circuit May have longer clock cycle
Beware: Execution time is the only final measure!Will Clock Cycle time increase as a result of having a more
complicated cache?
Hill [1988] suggested hit time for 2-way vs. 1-way is:external cache +10%,internal + 2%
O3: Higher Associativity
8/13/2019 ACA Unit-5
42/54
42
O4: Way Prediction & Pseudoassociative Caches
way prediction: extra bits are kept in cache to predict the way, orblock within the set of the next cache access
Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a
latency of 3 clock cycles excess of 85% accuracyreduce conflict miss and maintain the hit speed of direct-mapped cache
pseudoassociative or column associative On a miss, a 2nd cache entry is checked before going to the next lower
level one fast hit and one slow hit
Invert the most significant bit to the find other block in the pseudoset
Miss penalty may become slightly longer
O5: Compiler Optimizations
8/13/2019 ACA Unit-5
43/54
43
O5: Compiler Optimizations
Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989])
Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75%in an 8KB cache
Get best performance when it was possible to prevent some instruction fromentering the cache
Aligning basic block: the entry point is at the beginning of a
cache block Decrease the chance of a cache miss for sequential code Loop Interchange: exchanging the nesting of loops
Improve spatial locality => reduce misses Make data be accessed in order
=> maximize use of data in a cache block before discarded
/* Before: row first */
for(j=0;j
8/13/2019 ACA Unit-5
44/54
44
Blocking: operating on submatrices or blocksMaximize accesses to the data loaded into the cache before replacedImprove temporal localityX=Y*Z
/* Before */for(i=0;i
8/13/2019 ACA Unit-5
45/54
45
5.6 Reducing Cache Penalty or Miss Rate
via Parallelism
Three techniques that overlap the execution of instructions
1.Nonblocking caches to reduce stalls on cache misses
to match the out-of-order processors
2.Hardware prefetching of insructions and data
3.Compiler-controlled prefetching
8/13/2019 ACA Unit-5
46/54
O2 H d P f t hi f I t ti d D t
8/13/2019 ACA Unit-5
47/54
47
O2: Hardware Prefetching of Instructions and Data
Prefetch instructions or data before requested by the CPU either directly into the caches or into an external buffer (faster than
accessing main memory)Instruction prefetch: frequently done in hardware outside cache Fetch two blocks on a miss
the requested block is placed in I-cache when it returns the prefetched block is placed in ins truct io n stream buffer (ISB)
1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-blockdirect-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)
UltraSPARC III: data prefetch If a load hits in the prefetch cache
the block is read from the prefetch cache the next prefetch request is issued: calculating the stride of the next
prefetched block using the difference between the current address and theprevious address
Up to 8 simultaneous prefetches
It may interfere with demand misses resulting in lowering performance
O3 C il C t ll d P f t hi
8/13/2019 ACA Unit-5
48/54
48
O3: Compiler-Controlled Prefetching
Compiler-controlled prefetching Register prefetch: load the value into a register
Cache prefetch: load data only into the cache (not register)Faulting vs. nonfaulting: the address does or does not cause anexception for virtual address faults and protection violations normal load instruction = faulting register prefetch instruction
Most effective prefetch: semantically invisible to a program doesnt change the contents of registers and memory, and cannot cause virtual memory faults
nonbinding prefetch: nonfaulting cache prefetch Overlapping execution: CPU proceeds while the prefetched data are being
fetched
Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead
5 7 R d i Hit Ti
8/13/2019 ACA Unit-5
49/54
49
5.7 Reducing Hit Time
Importance of cache hit time
Average Memory Access Time= Hit Time + Miss Rate * Miss Penalty
More importantly, cache access time limits the clock cycle rate in
many processors today!
Fast hit time:Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache
Four techniques:
1.Small and simple caches
2.Avoiding address translation during indexing of the cache
3.Pipelined cache access
4.Trace caches
O1: Small and Simple Caches
8/13/2019 ACA Unit-5
50/54
50
O1: Small and Simple Caches
A time-consuming portion of a cache hit is using the index portion of theaddress to read the tag memoryand then compare it to the address
Guideline: smaller hardware is fasterWhy Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache?
Small data cache and thus fast clock rate
Guideline: simpler hardware is fasterDirect Mapped, on chip
General design:small and simple cache for 1st-level cache
Keeping the tags on chip and the data off chip for 2nd-level cachesThe emphasis recently is on fast clock time while hiding L1 misses with
dynamic execution and using L2 caches to avoid going to memory
8/13/2019 ACA Unit-5
51/54
8/13/2019 ACA Unit-5
52/54
52
Virtually indexed, physically tagged cache
CPU
TB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym Problem
CPU
$ TB
MEM
VA
PA
TagsPA
Overlap cache access
with VA translation:requires $ index toremain invariant
across translation
VATags
L2 $
O3: Pipelined Cache Access
8/13/2019 ACA Unit-5
53/54
53
O3: Pipelined Cache Access
Simply to pipeline cache access
Multiple clock cycle for 1st-level cache hit
Advantage: fast cycle time and slow hitExample: accessing instructions from I-cache
Pentium: 1 clock cycle
Pentium Pro ~ Pentium III: 2 clocks
Pentium 4: 4 clocks
Drawback: Increasing the number of pipeline stages leads to greater penalty on mispredicted branches and
more clock cycles between the issue of the load and the use of the data
Note that it increases the bandwidth of instructions rather thandecreasing the actual latency of a cache hit
O4: Trace Caches
8/13/2019 ACA Unit-5
54/54
O4: Trace Caches
Trace cache for instructions: find a dynamic sequence ofinstructions including taken branches to load into a cache block The cache blocks contain
dynamic traces of executed instructions determined by CPU
rather than static sequences of instructions determined by memory
branch prediction is folded into the cache: validated along with the
addresses to have a valid fetch i.e. Intel NetBurst microarchitecture
advantage: better utilization Trace caches store instructions only from the branch entry point to the exit
of the trace
Unused part of a long block entered or exited from a taken branch inconventional I-cache may not be fetched
Downside: store the same instructions multiple times