Advanced Microarchitecture

Advanced MicroarchitectureLecture 12: Caches and Memory

2

SRAM {Over|Re}view

• Chained inverters maintain a stable state• Access gates provide access to the cell• Writing to a cell involves over-powering the

two small storage invertersLecture 12: Caches and Memory

1 00 1

1 1

b b“6T SRAM” cell

2 access gates2T per inverter

3

64×1-bit SRAM Array Organization

Lecture 12: Caches and Memory 1-of-8 Decoder

1-of-8 Decoder

Why are we readingboth b and b?

“Wordline”

“Bitlines”

“ColumnMux”

4

SRAM Density vs. Speed• 6T cell must be small as possible to have

dense storage– Bigger caches– Smaller transistors slower transistors

Lecture 12: Caches and Memory

*Long* metal line with alot of parasitic loading

So dinky inverters cannotdrive their outputs very

quickly…

5

Sense Amplifiers• Type of differential amplifier

– Two inputs, amplifies the difference


DiffAmp

X

Ya × (X – Y) + Vbias

b

b

Bitlinesprecharged

To Vdd

Wordlineenabled

Small cell dischargesbitline very slowly

Sense amp “sees” the differencequickly and outputs b’s value

Sometimes prechargebitlines to Vdd/2 which

makes a bigger “delta”for faster sensing

6

Multi-Porting


b1 b1

Wordline1

b2 b2

Wordline2

Wordlines = 2 × portsBitlines = 4 × ports Area = O(ports2)

7

Port Requirements• ARF, PRF, RAT all need many read and write

ports to support superscalar execution– Luckily, these have limited number of

entries/bytes

• Caches also need multiple ports– Not as many ports– But the overall size is much larger


8

Delay Of Regular Caches• I$

– low port requirement (one fetch group/$-line per cycle)

– latency only exposed on branch mispredict• D$

– higher port requirement (multiple LD/ST per cycle)

– latency often on critical path of execution• L2

– lower port requirement (most accesses hit in L1)

– latency less important (only observed on L1 miss)

– optimizing for hit rate usually more important than latency• difference between L2 latency and DRAM latency is

large


9

Banking


DecoderDecoderDecoderDecoder

SRAMArray

SenseSenseSenseSense

ColumnMuxing

Big4-portedL1 DataCache

SDecoder

SRAMArray

SDecoder

SRAMArray

SDecoder

SRAMArray

SDecoder

SRAMArray

4 Banks, 1 port eachEach bank is much faster

Slow due to quadraticarea growth

10

Bank Conflicts• Banking provides high bandwidth• But only if all accesses are to different

banks

• Banks typically address interleaved– For N banks– Addr bank[Addr % N]

• Addr on cache line granularity

– For 4 banks, 2 accesses, chance of conflict is 25%

– Need to match # banks to access patterns/BWLecture 12: Caches and Memory

11

Associativity• You should know this already


foo’s value

foo

direct mapped

foo’s value

foo

foo

fully associativeRAM CAM

foo

foo’s valuefoo

set associativeCAM/RAM hybrid?

12

Set-Associative Caches• Set-associativity good for reducing conflict

misses• Cost: slower cache access

– often dominated by the tag array comparisons– Basically mini-CAM logic


• Must trade off:– Smaller cache size– Longer latency– Lower associativity

• Every option hurts performance

= = = =

40-50 bitcomparison!

13

Way-Prediction• If figuring out the way takes too long, then

just guess!


WayPred

LoadPC

Payload

S X X X

“E”

= = = =

Tag checkstill occursto validateway-pred

• May be hard to predict way if the same load accesses different addresses

14

Way-Prediction (2)• Organize data array s.t. left most way is

the MRU


MRU LRUAccesses

Way-predict the MRU wayWay-prediction keeps hitting

On way-miss, move blockto MRU position

Way-prediction continuesto hit

Way-Miss (Cache Hit)

Complication: data array needs datapathfor swapping blocks (maybe 100’s of bits)

Normally just update a few LRU bits inthe tag array (< 10 bits?)

15

Partial Tagging• Like BTBs, just use part of the tag


= = = = = = = =

Tag array lookupnow much faster!

Partial tags lead to false hits:Tag 0x45120001 looks like a hit

for Address 0x3B120001 Similar to way-prediction, full tagcomparison still needed to verify“real” hit --- not on critical path

16

… in the LSQ• Partial tagging can be used in the LSQ as

well


Do address check onpartial addresses only

On a partial hit,forward the dataSlower complete

tag check verifiesthe match/no match

Replay or flushas needed

If a store finds a later partially-matchedload, don’t do pipeline flush right away

Penalty is too severe, wait for slowcheck before flushing the pipe

17

Interaction With Scheduling• Bank conflicts, way-mispredictions, partial-

tag false hits– All change the latency of the load instruction

– Increases frequency of replays• more “replay conditions” exist/encountered

– Need careful tradeoff between• performance (reducing effective cache latency)• performance (frequency of replaying instructions)• power (frequency of replaying instructions)


18

Alternatives to Adding Associativity• More Set-Assoc needed when number of

items mapping to same cache set > number of ways

• Not all sets suffer from high conflict rates

• Idea: provide a little extra associativity, but not for each and every set


19

Victim Cache


A B C DE

X Y Z

J K L MN

A B CD

J K LM

A B C D

X Y Z

P Q RJ K L M

A B

J

VictimCache

AE

BC

N

JK

CD

K

L

L

M

Every access is a miss!ABCED and JKLMN

do not “fit” in a 4-wayset associative cache

Victim cache providesa “fifth way” so long asonly four sets overflowinto it at the same time

Can even provide 6th

or 7th … ways

P Q R

20

Skewed Associativity


ABCD

WXYZ

A X Y C B DW Z

Lots of misses

Regular Set-Associative Cache

AX

YC

Skewed-Associative Cache

BD

WZ

Fewer of misses

21

Required Associativity Varies• Program stack needs very little

associativity– spatial locality

• stack frame is laid out sequentially• function usually only refers to own stack frame


f()

g()

h()

j()k()

Call Stack

Addresseslaid out inlinearorganization

MRU LRU

Layout in 4-way Cache

Associativity not being used effectively

22

Stack Cache


f()

g()

h()

j()k()

“Nice” stackaccesses

Disorganizedheap accesses

Lots of conflicts!

“Regular”Cache

Stack Cache

23

Stack Cache (2)• Stack cache portion can be a lot simpler

due to direct-mapped structure– relatively easily prefetched for by monitoring

call/retn’s• “Regular” cache portion can have lower

associativity– doesn’t have conflicts due to stack/heap

interaction


24

Stack Cache (3)• Which cache does a load access?

– Many ISA’s have a “default” stack-pointer register


LDQ 0[$sp]LDQ 12[$sp]

LDQ 8[$t3]

LDQ 24[$sp]

LDQ 0[$t1]

Stack Cache

Regular Cache

MOV $t3 = $sp

X Need stack base and offsetinformation, and then needto check each cache accessagainst these bounds

Wrong cache replay

25

Multi-Lateral Caches• Normal cache is “uni-lateral” in that

everything goes into the same place

• Stack cache is an example of “multi-lateral” caches– multiple cache structures with disjoint contents– I$ vs. D$ could be considered multi-lateral


26

Access Patterns• Stack cache showed how different loads

exhibit different access patterns


Stack(multiple push/pop’s

of frames)

Heap(heavily data-dependent

access patterns)

Streaming(linear accesses

with low/no reuse)

27

Low-Reuse Accesses• Streaming

– once you’re done decoding MPEG frame, no need to revisit

• Other


Fields map to different cache lines

struct tree_t { int valid; int other_fields[24]; int num_children; struct tree_t * children;};

while(some condition) { struct tree_t * parent = getNextRoot(…); if(parent->valid) { doTreeTraversalStuff(parent); doMoreStuffToTree(parent); pickFruitFromTree(parent); }}

parent->valid accessed once,and then not used again

28

Filter Caches• Several proposed variations

– annex cache, pollution control cache, etc.


SmallFilterCache

MainCache

Fill on miss

First-time missesare placed in filtercache If accessed again, promote

to the main cache

If not accessed again, eventually LRU’d out

Main cache only containslines with proven reuse

One-time-use lines havebeen filtered out

Can be thought of as the“dual” of the victim cache

29

Trouble w/ Multi-Lateral Caches• More complexity

– load may need to be routed to different places• may require some form of prediction to pick the right

one– guessing wrong can cause replays

• or accessing multiple in parallel increases power– no bandwidth benefit

– more sources to bypass from• costs both latency and power in bypass network


30

Memory-Level Parallelism (MLP)• What if memory latency is 10000 cycles?

– Not enough traditional ILP to cover this latency– Runtime dominated by waiting for memory– What matters is overlapping memory accesses

• MLP: “number of outstanding cache misses [to main memory] that can be generated and executed in an overlapped manner.”

• ILP is a property of a DFG; MLP is a metric– ILP is independent of the underlying execution

engine– MLP is dependent on the microarchitecture

assumptions– You can measure MLP for uniprocessor, CMP,

etc.Lecture 12: Caches and Memory

31

uArchs for MLP• WIB – Waiting Instruction Buffer


Scheduler

Load miss

No instructions inforward slice canexecute

Eventually all independentinsts issue and schedulercontains only insts in the

forward slice… stalled

Scheduler

WIB

Loadmiss

Move forward sliceto separate buffer

Independent instscontinue to issue

New insts keepthe scheduler busy

Eventually exposeother independentload misses (MLP)

32

WIB Hardware• Similar to replay – continue issuing

dependent instructions, but need to shunt to the WIB

• WIB hardware can potentially be large– WIB doesn’t do scheduling – no CAM logic

needed

• Need to redispatch from WIB back into RS when load comes back from memory– like redispatching from replay-queue


Documents

Advanced Microarchitecture