Cache Extras & Branch Prediction. TLB and Cache Operation (See Figure 7.26 on page 594) On a memory access, the following operations occur

Cache Extras &Branch Prediction

TLB and Cache Operation (See Figure 7.26 on page 594)

• On a memory access, the following operations occur.

Virtual Memory And Cache

Yes

No

TLB Access

TLB Hit ?Access

Page Table

Access Cache

Virtual Address

Cache Hit ?

Yes

No AccessMemory

Physical Addresses

Data

TLB access is serial with cache access

Overlapped TLB & Cache Access

Physical page number Page offset

Virtual Memory view of a Physical Address

Disp

Cache view of a Physical Address

# SetTag

In the above example #Set is not contained within the Page Offset The #Set is not known Until the Physical page number is known Cache can be accessed only after address translation done.

3 2 1 011 10 9 815 14 13 1229 28 27

3 2 1 011 10 9 815 14 13 1229 28 27

Overlapped TLB & Cache Access (cont)

Physical page number Page offset

Virtual Memory view of a Physical Address

Disp

Cache view of a Physical Address

# SetTag

In the above example #Set is contained within the Page Offset The #Set is known immediately Cache can be accessed in parallel with address translation First the tags from the appropriate set are brought

Tag comparison takes place only after the Physical page number is

known (after address translation done).

3 2 1 011 10 9 815 14 13 1229 28 27

3 2 1 011 10 9 815 14 13 1229 28 27

Overlapped TLB & Cache Access (cont)

• First Stage– Virtual Page Number goes to TLB for translation

– Page offset goes to cache to get all tags from appropriate set

• Second Stage– The Physical Page Number, obtained from the TLB,

goes to the cache for tag comparison.

• Limitation: Cache ≤ (page size * associativity)

• How can we overcome this limitation ?– Fetch 2 sets and mux after translation

– Increase associativity

Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 7

Pseudo LRU We will use as an example a 4-way set associative

cache. Full LRU records the full-order of way access in each

set (which way was most recently accessed, which was second, and so on).

Pseudo LRU (PLRU) records a partial order, using 3 bits per-set:– Bit0 specifies whether LRU way is either one of 0 and 1 or one of 2

and 3.

– Bit1 specifies which of ways 0 and 1 was least recently used

– Bit2 specifies which of ways 2 and 3 was least recently used

For example if order in which ways were accessed is 3,0,2,1, then bit0=1, bit1=1, bit2=1 0 1 2 3

0 1

0 01 1

bit0

bit2bit1


Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Compulsory vanishinglysmall


How Can We Reduce Misses? 3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size is not changed: What happens if:

1) Change Block Size: Which of 3Cs is obviously affected?

2) Change Associativity: Which of 3Cs is obviously affected?

3) Change Compiler: Which of 3Cs is obviously affected?


Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

Reduce Misses via Larger Block Size


Reduce Misses via Higher Associativity

We have two conflicting trends here: Higher associativity

improve the hit ratio

BUT Increase the access time Slow down the replacement Increase complexity

Most of the modern cache memory systems are using at least 4-way associative cache memories


Example: Avg. Memory Access Time vs. Miss Rate

Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CAT of direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

Effective access time to cache(Red means -> not improved by more associativity)

Note this is for a specific example


Reducing Miss Penalty byCritical Word First

Don’t wait for full block to be loaded before restarting CPU

– Early restart As soon as the requested word of the block arrives, send it to the

CPU and let the CPU continue execution

– Critical Word First Request the missed word first from memory and send it to the

CPU as soon as it arrives Let the CPU continue execution while filling the rest of the words

in the block Also called wrapped fetch and requested word first

Example: – 64 bit = 8 byte bus, 32 byte cache line 4 bus cycles to fill line

– Fetch date from 95H80H-87H 88H-8FH 90H-97H 98H-9FH

1 24 3


Prefetchers

In order to avoid compulsory misses, we need to bring the information before it was requested by the program

We can use the locality of references behavior– Space -> bring the environment.

– Time -> same “patterns” repeats themselves.

Prefetching relies on having extra memory bandwidth that can be used without penalty

There are hardware and software prefetchers.


Hardware Prefetching

Instruction Prefetching–Alpha 21064 fetches 2 blocks on a miss

Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required

On miss check stream buffer

–Branch predictor directed prefetching Let branch predictor run ahead

Data Prefetching–Try to predict future data access

Next sequential Stride General pattern


Software Prefetching

Data Prefetch–Load data into register (HP PA-RISC loads)–Cache Prefetch: load into cache

(MIPS IV, PowerPC, SPARC v. 9)–Special prefetching instructions cannot cause faults;

a form of speculative execution How it is done

–Special prefetch intrinsic in the language–Automatically by the compiler

Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced

misses?–Higher superscalar reduces difficulty of issue

bandwidth

CS252/CullerLec 4.19

1/31/02

Reducing Misses by Compiler Optimizations

• McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software

• Instructions– Reorder procedures in memory so as to reduce conflict misses– Profiling to look at conflicts(using tools they developed)

• Data– Merging Arrays: improve spatial locality by single array of

compound elements vs. 2 arrays– Loop Interchange: change nesting of loops to access data in order

stored in memory– Loop Fusion: Combine 2 independent loops that have same looping

and some variables overlap– Blocking: Improve temporal locality by accessing “blocks” of data

repeatedly vs. going down whole columns or rows


1/31/02

Merging Arrays Example

/* Before: 2 sequential arrays */int val[SIZE];int key[SIZE];

/* After: 1 array of stuctures */struct merge {

int val;int key;

};struct merge merged_array[SIZE];

Reducing conflicts between val & key; improve spatial locality


1/31/02

Loop Interchange Example

/* Before */for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory every 100 words; improved spatial locality


1/31/02

Loop Fusion Example/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality


1/31/02

Blocking Example/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

{r = 0;

for (k = 0; k < N; k = k+1){

r = r + y[i][k]*z[k][j];};

x[i][j] = r;

};

• Two Inner Loops:– Read all NxN elements of z[]– Read N elements of 1 row of y[] repeatedly– Write N elements of 1 row of x[]

• Capacity Misses a function of N & Cache Size:– 2N3 + N2 => (assuming no conflict; otherwise …)

• Idea: compute on BxB submatrix that fits


1/31/02

Blocking Example/* After */for (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)

for (j = jj; j < min(jj+B-1,N); j = j+1){r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) {

r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r;};

• B called Blocking Factor• Capacity Misses from 2N3 + N2 to N3/B+2N2

• Conflict Misses Too?


Other techniques


Multi-ported cache and Banked Cache

A n-ported cache enables n cache accesses in parallel– Parallelize cache access in different pipeline stages

– Parallelize cache access in a super-scalar processors

Effectively doubles the cache die size Possible solution: banked cache

– Each line is divided to n banks

– Can fetch data from k n different banks (in possibly different lines)


Separate Code / Data Caches

Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors)

Code cache is a read only cache– No need to write back line into memory when evicted

– Simpler to manage

What about self modified code? (X86 only)– Whenever executing a memory write need to snoop the code

cache

– If the code cache contains the written address, the line in which the address is contained is invalidated

– Now the code cache is accessed both in the fetch stage and in the memory access stage Tags need to be dual ported to avoid stalling


Increasing the size with minimum latency loss - L2 cache

L2 is much larger than L1 (256K-1M in compare to 32K-64K)

Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip)– If L2 is on-chip, why not just make L1 larger?

Can be inclusive:– All addresses in L1 are also contained in L2– Data in L1 may be more updated than in L2– L2 is unified (code / data)

Most architectures do not require the caches to be inclusive (although, due to the size difference they are)


Victim Cache Problem: per set load may non-uniform

– some sets may have more conflict misses than others Solution: allocate ways to sets dynamically, according to the load When a line is evicted from the cache it placed on the victim cache

– If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1

On cache lookup, victim cache lookup is also performed (in parallel)

On victim cache hit,

– line is moved back to cache

– evicted line moved to the victim cache

– Same access time as cache hit Especially effective for direct mapped cache

– Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses


Stream Buffers

Before inserting a new line into the cache put it in a stream buffer

Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future

Example:– Assume that we scan a very large array (much larger than

the cache), and we access each item in the array just once

– If we inset the array into the cache it will thrash the entire cache

– If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache

Virtual memory and process switch

• Whenever we switch the execution between processes we need to:

– Save the state of the process

– Load the state of the new process

– Make sure that the P0BPT and P1BPT registers are pointing to the new translation tables

– Clean the TLB

– We do not need to clean the cache (Why????)

Prediction:Branches, Dependencies, Data

New era in computing?• Prediction has become essential to getting good

performance from scalar instruction streams.

• We will discuss predicting branches, data dependencies, actual data, and results of groups of instructions:

– At what point does computation become a probabilistic operation + verification?

– We are pretty close with control hazards already…

• Why does prediction work?– Underlying algorithm has regularities.

– Data that is being operated on has regularities.

– Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems.

• Prediction Compressible information streams?

Dynamic Branch PredictionBranch-Prediction Buffer (branch history table):

• The simplest thing to do with a branch is to predict whether or not it is taken.

• This helps in pipelines where the branch delay is longer than the time it takes to compute the possible target PCs .

– If we can save the decision time, we can branch sooner.

• Note that this scheme does NOT help with the MIPS we studied.

– Since the branch decision and target PC are computed in ID, assuming there is no hazard on the register tested.

Basic Idea

• Keep a buffer (cache) indexed by the lower portion of the address of the branch instruction.

– Along with some bit(s) to indicate whether or not the branch was recently taken or not.

• If the prediction is incorrect , the prediction bit is inverted and stored back.

• The branch direction could be incorrect because:

• Of misprediction OR

• Instruction mismatch

• In either case, the worst that happens is that you have to pay the full latency for the branch.

Need Address at Same Time as Prediction

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong branch address

• Return instruction addresses predicted with stack

Branch PC Predicted PC

=?

PC

of in

stru

ctio

nFETC

H

Predict taken or untaken

Avi Mendelson 3/2005 © 38 Basic MIPS Arch, pipeline

Dynamic Branch Prediction

Look up

PC of inst in fetch

?=Branch predicted taken or not taken

No:Inst is not pred to be branch

Yes:Inst is pred to be branch

Branch PC Target PC History

Predicted Target

Add a Branch Target Buffer (BTB) the predicts (at fetch) Instruction is a branch Branch taken / not-taken Taken branch target


BTB Allocation

Allocate instructions identified as branches (after decode) Both conditional and unconditional branches are allocated

Not taken branches need not be allocated BTB miss implicitly predicts not-taken

Prediction BTB lookup is done parallel to IC lookup BTB provides

Indication that the instruction is a branch (BTB hits) Branch predicted target Branch predicted direction Branch predicted type (e.g., conditional, unconditional)

Update (when branch outcome is known) Branch target Branch history (taken / not-taken)


BTB (cont.)

Wrong prediction Predict not-taken, actual taken Predict taken, actual not-taken, or actual taken but wrong

target In case of wrong prediction – flush the pipeline

Reset latches (same as making all instructions to be NOPs) Select the PC source to be from the correct path

Need get the fall-through with the branch Start fetching instruction from correct path

Assuming P% correct prediction rate

CPI new = 1 + (0.2 × (1-P)) × 3 For example, if P=0.7

CPI new = 1 + (0.2 × 0.3) × 3 = 1.18

Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of misprediction)

• Branch History Table: Lower bits of PC address index table of 1-bit values

– Says whether or not branch taken last time

– No address check (saves HW, but may not be right branch)

• Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit):

– End of loop case, when it exits instead of looping as before

– First time through loop on next time through code, when it predicts exit instead of looping

– Only 80% accuracy even if loop 90% of the time

• Solution: 2-bit scheme where change prediction only if get misprediction twice:

• Red: stop, not taken

• Green: go, taken

• Adds hysteresis to decision making process

Dynamic Branch Prediction(Jim Smith, 1981)

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

NT

BHT Accuracy• Mispredict because either:

– Wrong guess for that branch

– Got branch history of wrong branch when index the table

• 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

• 4096 about as good as infinite table(in Alpha 211164)

7 Branch Prediction Schemes

1. 1-bit Branch-Prediction Buffer

2. 2-bit Branch-Prediction Buffer

3. Correlating Branch Prediction Buffer

4. Tournament Branch Predictor

5. Branch Target Buffer

6. Integrated Instruction Fetch Units

7. Return Address Predictors

Accuracy v. Size (SPEC89)

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Con

ditio

nal b

ranc

h m

ispr

edic

tion

rate

Local

Correlating

Tournament

Special Case Return Addresses

• Register Indirect branch hard to predict address

• SPEC89 85% such branches for procedure return

• Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate


Adding a BTB to the Pipeline

ALUSrc

6

ALUresult

Zero

+Shift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EXEX/MEM MEM

/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4+

IF/ID

PC

0

1

mux

0

1

mux

0mux

1

0

mux

Inst.Memory

Address

Instruction

BTB

1

2

pred target

pred dir

PC+4 (Not-taken target)taken target

3

MispredictDetection

Unit

Flush

predicted targetPC+4 (Not-taken target)

predicted direction

−4

address

target

direction

alloc/u

pd

t


Using The BTBPC moves to next instruction

Inst Mem gets PCand fetches new inst

BTB gets PCand looks it up

IF/ID latch loadedwith new inst

BTB Hit ?

Br taken ?

PC PC + 4PC perd addr

IF

IDIF/ID latch loaded

with pred instIF/ID latch loaded

with seq. instBranch ?

yes no

noyes

noyesEXE


Using The BTB (cont.)ID

EXE

MEM

WB

Branch ?

Calculate brcond & trgt

Flush pipe &update PC

Corect pred ?

yes no

IF/ID latch loadedwith correct inst

continue

Update BTB

yes no

continue

Dynamic Branch Prediction Summary

• Prediction becoming important part of scalar execution

• Branch History Table: 2 bits for loop accuracy

• Correlation: Recently executed branches correlated with next branch.

– Either different branches

– Or different executions of same branches

• Tournament Predictor: more resources to competitive solutions and pick between them

• Branch Target Buffer: include branch address & prediction

• Predicated Execution can reduce number of branches, number of mispredicted branches

• Return address stack for prediction of indirect jump

Conclusion

• 1985-2000: 1000X performance – Moore’s Law transistors/chip => Moore’s Law for Performance/MPU

• Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year– Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order

execution, …

• ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW?– Otherwise drop to old rate of 1.3X per year?– Less than 1.3X because of processor-memory performance gap?

• Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?

שאלה לדוגמא מבקשים שתחשבו באיזה אחוז תשתפר יעילות Intelמהנדסי •

תשפר את Branch Prediction אם יחידת ה CPUהעבודה של יכולת הניבוי באחוז אחד ובשני אחוזים. לשם כך עליך לענות על

השאלות הבאות. כל החישובים הם עפ"י ההנחות הבאות: ) conditional branch) פקודות מכונה היא פקודת קפיצה מותנת (1:5אחת לחמש (–

אחרים מלבד אלה שנובעים stalls. (הנח שאין stalls 12 מחיר ניבוי לא נכון הוא –מניחוש לא נכון).

מהקפיצות 95% מנחש נכון ב Branch Prediction Unitלפני השיפור ה –)Branches.(

95% כאשר היחידה מנחשת נכון ב- stalls) מה אחוז ה 4%(•מהקפיצות המותנות (בריצה ארוכה איזה אחוז ממחזורי השעון

)?stallsיהיו

96% כאשר היחידה מנחשת נכון ב- stalls) מה אחוז ה 2%(•מהקפיצות המותנות?

97% כאשר היחידה מנחשת נכון ב- stalls) מה אחוז ה 2%(•מהקפיצות המותנות?

?stalls 25הסבר את חישוביך. מה אם •

Virtually Addressed Caches• On MIPS R2000, the TLB translated the virtual address to a

physical address before the address was sent to the cache => physically addressed cache.

• Another approach is to have the virtual address be sent directly to the cache=> virtually addressed cache

– Avoids translation if data is in the cache

– If data is not in the cache, the address is translated by a TLB/page table before going to main memory.

– Often requires larger tags

– Can result in aliasing, if two virtual addresses map to the same location in physical memory.

• With a virtually indexed cache, the tag gets translated into a physical address, while the rest of the address is accessing the cache.

Virtually Addressed Cache

Only require address translation on cache miss!

synonym problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!

nightmare for update: must update all cache entries with same physical address or memory becomes inconsistent

determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits.

data

CPUTrans-lation

Cache

MainMemory

VA

hit

PA

Memory Protection• With multiprogramming, a computer is shared by several

programs or processes running concurrently– Need to provide protection

– Need to allow sharing

• Mechanisms for providing protection– Provide both user and supervisor (operating system) modes

– Provide CPU state that the user can read, but cannot write

» user/supervisor bit, page table pointer, and TLB

– Provide method to go from user to supervisor mode and vice versa

» system call or exception : user to supervisor

» system or exception return : supervisor to user

– Provide permissions for each page in memory

– Store page tables in the operating systems address space - can’t be accessed directly by user.

Handling TLB Misses and Page Faults

• When a TLB miss occurs either– Page is present in memory and update the TLB

» occurs if valid bit of page table is set

– Page is not present in memory and O.S. gets control to handle a page fault

• If a page fault occur, the operating system– Access the page table to determine the physical location of the page on disk

– Chooses a physical page to replace - if the replaced page is dirty it is written to disk

– Reads a page from disk into the chosen physical page in main memory.

• Since the disk access takes so long, another process is typically allowed to run during a page fault.

Pitfall: Address space to small• One of the biggest mistakes than can be made when

designing an architect is to devote to few bits to the address

– address size limits the size of virtual memory

– difficult to change since many components depend on it (e.g., PC, registers, effective-address calculations)

• As program size increases, larger and larger address sizes are needed

– 8 bit: Intel 8080 (1975)

– 16 bit: Intel 8086 (1978)

– 24 bit: Intel 80286 (1982)

– 32 bit: Intel 80386 (1985)

– 64 bit: Intel Merced (1998)

Virtual Memory Summary

• Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk).

• Page tables and TLBS are used to translate the virtual address to a physical address

• The large miss penalty of virtual memory leads to different strategies from cache

– Fully associative

– LRU or LRU approximation

– Write-back

– Done by software

Documents

Cache Extras & Branch Prediction. TLB and Cache Operation (See Figure 7.26 on page 594) On a memory access, the following operations occur