View
225
Download
0
Tags:
Embed Size (px)
Citation preview
Cache Extras &Branch Prediction
TLB and Cache Operation (See Figure 7.26 on page 594)
• On a memory access, the following operations occur.
Virtual Memory And Cache
Yes
No
TLB Access
TLB Hit ?Access
Page Table
Access Cache
Virtual Address
Cache Hit ?
Yes
No AccessMemory
Physical Addresses
Data
TLB access is serial with cache access
Overlapped TLB & Cache Access
Physical page number Page offset
Virtual Memory view of a Physical Address
Disp
Cache view of a Physical Address
# SetTag
In the above example #Set is not contained within the Page Offset The #Set is not known Until the Physical page number is known Cache can be accessed only after address translation done.
3 2 1 011 10 9 815 14 13 1229 28 27
3 2 1 011 10 9 815 14 13 1229 28 27
Overlapped TLB & Cache Access (cont)
Physical page number Page offset
Virtual Memory view of a Physical Address
Disp
Cache view of a Physical Address
# SetTag
In the above example #Set is contained within the Page Offset The #Set is known immediately Cache can be accessed in parallel with address translation First the tags from the appropriate set are brought
Tag comparison takes place only after the Physical page number is
known (after address translation done).
3 2 1 011 10 9 815 14 13 1229 28 27
3 2 1 011 10 9 815 14 13 1229 28 27
Overlapped TLB & Cache Access (cont)
• First Stage– Virtual Page Number goes to TLB for translation
– Page offset goes to cache to get all tags from appropriate set
• Second Stage– The Physical Page Number, obtained from the TLB,
goes to the cache for tag comparison.
• Limitation: Cache ≤ (page size * associativity)
• How can we overcome this limitation ?– Fetch 2 sets and mux after translation
– Increase associativity
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 7
Pseudo LRU We will use as an example a 4-way set associative
cache. Full LRU records the full-order of way access in each
set (which way was most recently accessed, which was second, and so on).
Pseudo LRU (PLRU) records a partial order, using 3 bits per-set:– Bit0 specifies whether LRU way is either one of 0 and 1 or one of 2
and 3.
– Bit1 specifies which of ways 0 and 1 was least recently used
– Bit2 specifies which of ways 2 and 3 was least recently used
For example if order in which ways were accessed is 3,0,2,1, then bit0=1, bit1=1, bit2=1 0 1 2 3
0 1
0 01 1
bit0
bit2bit1
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 8
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.141 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
3Cs Absolute Miss Rate (SPEC92)
Conflict
Compulsory vanishinglysmall
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 9
How Can We Reduce Misses? 3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size is not changed: What happens if:
1) Change Block Size: Which of 3Cs is obviously affected?
2) Change Associativity: Which of 3Cs is obviously affected?
3) Change Compiler: Which of 3Cs is obviously affected?
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 10
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16
32
64
12
8
25
6
1K
4K
16K
64K
256K
Reduce Misses via Larger Block Size
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 11
Reduce Misses via Higher Associativity
We have two conflicting trends here: Higher associativity
improve the hit ratio
BUT Increase the access time Slow down the replacement Increase complexity
Most of the modern cache memory systems are using at least 4-way associative cache memories
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 12
Example: Avg. Memory Access Time vs. Miss Rate
Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CAT of direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
Effective access time to cache(Red means -> not improved by more associativity)
Note this is for a specific example
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 13
Reducing Miss Penalty byCritical Word First
Don’t wait for full block to be loaded before restarting CPU
– Early restart As soon as the requested word of the block arrives, send it to the
CPU and let the CPU continue execution
– Critical Word First Request the missed word first from memory and send it to the
CPU as soon as it arrives Let the CPU continue execution while filling the rest of the words
in the block Also called wrapped fetch and requested word first
Example: – 64 bit = 8 byte bus, 32 byte cache line 4 bus cycles to fill line
– Fetch date from 95H80H-87H 88H-8FH 90H-97H 98H-9FH
1 24 3
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 14
Prefetchers
In order to avoid compulsory misses, we need to bring the information before it was requested by the program
We can use the locality of references behavior– Space -> bring the environment.
– Time -> same “patterns” repeats themselves.
Prefetching relies on having extra memory bandwidth that can be used without penalty
There are hardware and software prefetchers.
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 15
Hardware Prefetching
Instruction Prefetching–Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required
On miss check stream buffer
–Branch predictor directed prefetching Let branch predictor run ahead
Data Prefetching–Try to predict future data access
Next sequential Stride General pattern
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 16
Software Prefetching
Data Prefetch–Load data into register (HP PA-RISC loads)–Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)–Special prefetching instructions cannot cause faults;
a form of speculative execution How it is done
–Special prefetch intrinsic in the language–Automatically by the compiler
Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced
misses?–Higher superscalar reduces difficulty of issue
bandwidth
CS252/CullerLec 4.19
1/31/02
Reducing Misses by Compiler Optimizations
• McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
• Instructions– Reorder procedures in memory so as to reduce conflict misses– Profiling to look at conflicts(using tools they developed)
• Data– Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays– Loop Interchange: change nesting of loops to access data in order
stored in memory– Loop Fusion: Combine 2 independent loops that have same looping
and some variables overlap– Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows
CS252/CullerLec 4.20
1/31/02
Merging Arrays Example
/* Before: 2 sequential arrays */int val[SIZE];int key[SIZE];
/* After: 1 array of stuctures */struct merge {
int val;int key;
};struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve spatial locality
CS252/CullerLec 4.21
1/31/02
Loop Interchange Example
/* Before */for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
CS252/CullerLec 4.22
1/31/02
Loop Fusion Example/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality
CS252/CullerLec 4.23
1/31/02
Blocking Example/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:– Read all NxN elements of z[]– Read N elements of 1 row of y[] repeatedly– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
CS252/CullerLec 4.24
1/31/02
Blocking Example/* After */for (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1){r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r;};
• B called Blocking Factor• Capacity Misses from 2N3 + N2 to N3/B+2N2
• Conflict Misses Too?
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 25
Other techniques
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 26
Multi-ported cache and Banked Cache
A n-ported cache enables n cache accesses in parallel– Parallelize cache access in different pipeline stages
– Parallelize cache access in a super-scalar processors
Effectively doubles the cache die size Possible solution: banked cache
– Each line is divided to n banks
– Can fetch data from k n different banks (in possibly different lines)
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 27
Separate Code / Data Caches
Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors)
Code cache is a read only cache– No need to write back line into memory when evicted
– Simpler to manage
What about self modified code? (X86 only)– Whenever executing a memory write need to snoop the code
cache
– If the code cache contains the written address, the line in which the address is contained is invalidated
– Now the code cache is accessed both in the fetch stage and in the memory access stage Tags need to be dual ported to avoid stalling
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 28
Increasing the size with minimum latency loss - L2 cache
L2 is much larger than L1 (256K-1M in compare to 32K-64K)
Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip)– If L2 is on-chip, why not just make L1 larger?
Can be inclusive:– All addresses in L1 are also contained in L2– Data in L1 may be more updated than in L2– L2 is unified (code / data)
Most architectures do not require the caches to be inclusive (although, due to the size difference they are)
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 29
Victim Cache Problem: per set load may non-uniform
– some sets may have more conflict misses than others Solution: allocate ways to sets dynamically, according to the load When a line is evicted from the cache it placed on the victim cache
– If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1
On cache lookup, victim cache lookup is also performed (in parallel)
On victim cache hit,
– line is moved back to cache
– evicted line moved to the victim cache
– Same access time as cache hit Especially effective for direct mapped cache
– Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 30
Stream Buffers
Before inserting a new line into the cache put it in a stream buffer
Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future
Example:– Assume that we scan a very large array (much larger than
the cache), and we access each item in the array just once
– If we inset the array into the cache it will thrash the entire cache
– If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache
Virtual memory and process switch
• Whenever we switch the execution between processes we need to:
– Save the state of the process
– Load the state of the new process
– Make sure that the P0BPT and P1BPT registers are pointing to the new translation tables
– Clean the TLB
– We do not need to clean the cache (Why????)
Prediction:Branches, Dependencies, Data
New era in computing?• Prediction has become essential to getting good
performance from scalar instruction streams.
• We will discuss predicting branches, data dependencies, actual data, and results of groups of instructions:
– At what point does computation become a probabilistic operation + verification?
– We are pretty close with control hazards already…
• Why does prediction work?– Underlying algorithm has regularities.
– Data that is being operated on has regularities.
– Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems.
• Prediction Compressible information streams?
Dynamic Branch PredictionBranch-Prediction Buffer (branch history table):
• The simplest thing to do with a branch is to predict whether or not it is taken.
• This helps in pipelines where the branch delay is longer than the time it takes to compute the possible target PCs .
– If we can save the decision time, we can branch sooner.
• Note that this scheme does NOT help with the MIPS we studied.
– Since the branch decision and target PC are computed in ID, assuming there is no hazard on the register tested.
Basic Idea
• Keep a buffer (cache) indexed by the lower portion of the address of the branch instruction.
– Along with some bit(s) to indicate whether or not the branch was recently taken or not.
• If the prediction is incorrect , the prediction bit is inverted and stored back.
• The branch direction could be incorrect because:
• Of misprediction OR
• Instruction mismatch
• In either case, the worst that happens is that you have to pay the full latency for the branch.
Need Address at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong branch address
• Return instruction addresses predicted with stack
Branch PC Predicted PC
=?
PC
of in
stru
ctio
nFETC
H
Predict taken or untaken
Avi Mendelson 3/2005 © 38 Basic MIPS Arch, pipeline
Dynamic Branch Prediction
Look up
PC of inst in fetch
?=Branch predicted taken or not taken
No:Inst is not pred to be branch
Yes:Inst is pred to be branch
Branch PC Target PC History
Predicted Target
Add a Branch Target Buffer (BTB) the predicts (at fetch) Instruction is a branch Branch taken / not-taken Taken branch target
Avi Mendelson 3/2005 © 39 Basic MIPS Arch, pipeline
BTB Allocation
Allocate instructions identified as branches (after decode) Both conditional and unconditional branches are allocated
Not taken branches need not be allocated BTB miss implicitly predicts not-taken
Prediction BTB lookup is done parallel to IC lookup BTB provides
Indication that the instruction is a branch (BTB hits) Branch predicted target Branch predicted direction Branch predicted type (e.g., conditional, unconditional)
Update (when branch outcome is known) Branch target Branch history (taken / not-taken)
Avi Mendelson 3/2005 © 40 Basic MIPS Arch, pipeline
BTB (cont.)
Wrong prediction Predict not-taken, actual taken Predict taken, actual not-taken, or actual taken but wrong
target In case of wrong prediction – flush the pipeline
Reset latches (same as making all instructions to be NOPs) Select the PC source to be from the correct path
Need get the fall-through with the branch Start fetching instruction from correct path
Assuming P% correct prediction rate
CPI new = 1 + (0.2 × (1-P)) × 3 For example, if P=0.7
CPI new = 1 + (0.2 × 0.3) × 3 = 1.18
Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Table: Lower bits of PC address index table of 1-bit values
– Says whether or not branch taken last time
– No address check (saves HW, but may not be right branch)
• Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it predicts exit instead of looping
– Only 80% accuracy even if loop 90% of the time
• Solution: 2-bit scheme where change prediction only if get misprediction twice:
• Red: stop, not taken
• Green: go, taken
• Adds hysteresis to decision making process
Dynamic Branch Prediction(Jim Smith, 1981)
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
NT
T
NT
NT
BHT Accuracy• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• 4096 about as good as infinite table(in Alpha 211164)
7 Branch Prediction Schemes
1. 1-bit Branch-Prediction Buffer
2. 2-bit Branch-Prediction Buffer
3. Correlating Branch Prediction Buffer
4. Tournament Branch Predictor
5. Branch Target Buffer
6. Integrated Instruction Fetch Units
7. Return Address Predictors
Accuracy v. Size (SPEC89)
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)
Con
ditio
nal b
ranc
h m
ispr
edic
tion
rate
Local
Correlating
Tournament
Special Case Return Addresses
• Register Indirect branch hard to predict address
• SPEC89 85% such branches for procedure return
• Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate
Avi Mendelson 3/2005 © 48 Basic MIPS Arch, pipeline
Adding a BTB to the Pipeline
ALUSrc
6
ALUresult
Zero
+Shift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EXEX/MEM MEM
/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4+
IF/ID
PC
0
1
mux
0
1
mux
0mux
1
0
mux
Inst.Memory
Address
Instruction
BTB
1
2
pred target
pred dir
PC+4 (Not-taken target)taken target
3
MispredictDetection
Unit
Flush
predicted targetPC+4 (Not-taken target)
predicted direction
−4
address
target
direction
alloc/u
pd
t
Avi Mendelson 3/2005 © 49 Basic MIPS Arch, pipeline
Using The BTBPC moves to next instruction
Inst Mem gets PCand fetches new inst
BTB gets PCand looks it up
IF/ID latch loadedwith new inst
BTB Hit ?
Br taken ?
PC PC + 4PC perd addr
IF
IDIF/ID latch loaded
with pred instIF/ID latch loaded
with seq. instBranch ?
yes no
noyes
noyesEXE
Avi Mendelson 3/2005 © 50 Basic MIPS Arch, pipeline
Using The BTB (cont.)ID
EXE
MEM
WB
Branch ?
Calculate brcond & trgt
Flush pipe &update PC
Corect pred ?
yes no
IF/ID latch loadedwith correct inst
continue
Update BTB
yes no
continue
Dynamic Branch Prediction Summary
• Prediction becoming important part of scalar execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated with next branch.
– Either different branches
– Or different executions of same branches
• Tournament Predictor: more resources to competitive solutions and pick between them
• Branch Target Buffer: include branch address & prediction
• Predicated Execution can reduce number of branches, number of mispredicted branches
• Return address stack for prediction of indirect jump
Conclusion
• 1985-2000: 1000X performance – Moore’s Law transistors/chip => Moore’s Law for Performance/MPU
• Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year– Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order
execution, …
• ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW?– Otherwise drop to old rate of 1.3X per year?– Less than 1.3X because of processor-memory performance gap?
• Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?
שאלה לדוגמא מבקשים שתחשבו באיזה אחוז תשתפר יעילות Intelמהנדסי •
תשפר את Branch Prediction אם יחידת ה CPUהעבודה של יכולת הניבוי באחוז אחד ובשני אחוזים. לשם כך עליך לענות על
השאלות הבאות. כל החישובים הם עפ"י ההנחות הבאות: ) conditional branch) פקודות מכונה היא פקודת קפיצה מותנת (1:5אחת לחמש (–
אחרים מלבד אלה שנובעים stalls. (הנח שאין stalls 12 מחיר ניבוי לא נכון הוא –מניחוש לא נכון).
מהקפיצות 95% מנחש נכון ב Branch Prediction Unitלפני השיפור ה –)Branches.(
95% כאשר היחידה מנחשת נכון ב- stalls) מה אחוז ה 4%(•מהקפיצות המותנות (בריצה ארוכה איזה אחוז ממחזורי השעון
)?stallsיהיו
96% כאשר היחידה מנחשת נכון ב- stalls) מה אחוז ה 2%(•מהקפיצות המותנות?
97% כאשר היחידה מנחשת נכון ב- stalls) מה אחוז ה 2%(•מהקפיצות המותנות?
?stalls 25הסבר את חישוביך. מה אם •
Virtually Addressed Caches• On MIPS R2000, the TLB translated the virtual address to a
physical address before the address was sent to the cache => physically addressed cache.
• Another approach is to have the virtual address be sent directly to the cache=> virtually addressed cache
– Avoids translation if data is in the cache
– If data is not in the cache, the address is translated by a TLB/page table before going to main memory.
– Often requires larger tags
– Can result in aliasing, if two virtual addresses map to the same location in physical memory.
• With a virtually indexed cache, the tag gets translated into a physical address, while the rest of the address is accessing the cache.
Virtually Addressed Cache
Only require address translation on cache miss!
synonym problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!
nightmare for update: must update all cache entries with same physical address or memory becomes inconsistent
determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits.
data
CPUTrans-lation
Cache
MainMemory
VA
hit
PA
Memory Protection• With multiprogramming, a computer is shared by several
programs or processes running concurrently– Need to provide protection
– Need to allow sharing
• Mechanisms for providing protection– Provide both user and supervisor (operating system) modes
– Provide CPU state that the user can read, but cannot write
» user/supervisor bit, page table pointer, and TLB
– Provide method to go from user to supervisor mode and vice versa
» system call or exception : user to supervisor
» system or exception return : supervisor to user
– Provide permissions for each page in memory
– Store page tables in the operating systems address space - can’t be accessed directly by user.
Handling TLB Misses and Page Faults
• When a TLB miss occurs either– Page is present in memory and update the TLB
» occurs if valid bit of page table is set
– Page is not present in memory and O.S. gets control to handle a page fault
• If a page fault occur, the operating system– Access the page table to determine the physical location of the page on disk
– Chooses a physical page to replace - if the replaced page is dirty it is written to disk
– Reads a page from disk into the chosen physical page in main memory.
• Since the disk access takes so long, another process is typically allowed to run during a page fault.
Pitfall: Address space to small• One of the biggest mistakes than can be made when
designing an architect is to devote to few bits to the address
– address size limits the size of virtual memory
– difficult to change since many components depend on it (e.g., PC, registers, effective-address calculations)
• As program size increases, larger and larger address sizes are needed
– 8 bit: Intel 8080 (1975)
– 16 bit: Intel 8086 (1978)
– 24 bit: Intel 80286 (1982)
– 32 bit: Intel 80386 (1985)
– 64 bit: Intel Merced (1998)
Virtual Memory Summary
• Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk).
• Page tables and TLBS are used to translate the virtual address to a physical address
• The large miss penalty of virtual memory leads to different strategies from cache
– Fully associative
– LRU or LRU approximation
– Write-back
– Done by software