32
Advanced Microarchitecture Lecture 12: Caches and Memory

Advanced Microarchitecture

  • Upload
    dezso

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 12: Caches and Memory. 1. 1. “6T SRAM” cell 2 access gates 2T per inverter. b. b. SRAM {Over|Re}view. Chained inverters maintain a stable state Access gates provide access to the cell - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureLecture 12: Caches and Memory

Page 2: Advanced  Microarchitecture

2

SRAM {Over|Re}view

• Chained inverters maintain a stable state• Access gates provide access to the cell• Writing to a cell involves over-powering the

two small storage invertersLecture 12: Caches and Memory

1 00 1

1 1

b b“6T SRAM” cell

2 access gates2T per inverter

Page 3: Advanced  Microarchitecture

3

64×1-bit SRAM Array Organization

Lecture 12: Caches and Memory 1-of-8 Decoder

1-of-8 Decoder

Why are we readingboth b and b?

“Wordline”

“Bitlines”

“ColumnMux”

Page 4: Advanced  Microarchitecture

4

SRAM Density vs. Speed• 6T cell must be small as possible to have

dense storage– Bigger caches– Smaller transistors slower transistors

Lecture 12: Caches and Memory

*Long* metal line with alot of parasitic loading

So dinky inverters cannotdrive their outputs very

quickly…

Page 5: Advanced  Microarchitecture

5

Sense Amplifiers• Type of differential amplifier

– Two inputs, amplifies the difference

Lecture 12: Caches and Memory

DiffAmp

X

Ya × (X – Y) + Vbias

b

b

Bitlinesprecharged

To Vdd

Wordlineenabled

Small cell dischargesbitline very slowly

Sense amp “sees” the differencequickly and outputs b’s value

Sometimes prechargebitlines to Vdd/2 which

makes a bigger “delta”for faster sensing

Page 6: Advanced  Microarchitecture

6

Multi-Porting

Lecture 12: Caches and Memory

b1 b1

Wordline1

b2 b2

Wordline2

Wordlines = 2 × portsBitlines = 4 × ports Area = O(ports2)

Page 7: Advanced  Microarchitecture

7

Port Requirements• ARF, PRF, RAT all need many read and write

ports to support superscalar execution– Luckily, these have limited number of

entries/bytes

• Caches also need multiple ports– Not as many ports– But the overall size is much larger

Lecture 12: Caches and Memory

Page 8: Advanced  Microarchitecture

8

Delay Of Regular Caches• I$

– low port requirement (one fetch group/$-line per cycle)

– latency only exposed on branch mispredict• D$

– higher port requirement (multiple LD/ST per cycle)

– latency often on critical path of execution• L2

– lower port requirement (most accesses hit in L1)

– latency less important (only observed on L1 miss)

– optimizing for hit rate usually more important than latency• difference between L2 latency and DRAM latency is

large

Lecture 12: Caches and Memory

Page 9: Advanced  Microarchitecture

9

Banking

Lecture 12: Caches and Memory

DecoderDecoderDecoderDecoder

SRAMArray

SenseSenseSenseSense

ColumnMuxing

Big4-portedL1 DataCache

SDecoder

SRAMArray

SDecoder

SRAMArray

SDecoder

SRAMArray

SDecoder

SRAMArray

4 Banks, 1 port eachEach bank is much faster

Slow due to quadraticarea growth

Page 10: Advanced  Microarchitecture

10

Bank Conflicts• Banking provides high bandwidth• But only if all accesses are to different

banks

• Banks typically address interleaved– For N banks– Addr bank[Addr % N]

• Addr on cache line granularity

– For 4 banks, 2 accesses, chance of conflict is 25%

– Need to match # banks to access patterns/BWLecture 12: Caches and Memory

Page 11: Advanced  Microarchitecture

11

Associativity• You should know this already

Lecture 12: Caches and Memory

foo’s value

foo

direct mapped

foo’s value

foo

foo

fully associativeRAM CAM

foo

foo’s valuefoo

set associativeCAM/RAM hybrid?

Page 12: Advanced  Microarchitecture

12

Set-Associative Caches• Set-associativity good for reducing conflict

misses• Cost: slower cache access

– often dominated by the tag array comparisons– Basically mini-CAM logic

Lecture 12: Caches and Memory

• Must trade off:– Smaller cache size– Longer latency– Lower associativity

• Every option hurts performance

= = = =

40-50 bitcomparison!

Page 13: Advanced  Microarchitecture

13

Way-Prediction• If figuring out the way takes too long, then

just guess!

Lecture 12: Caches and Memory

WayPred

LoadPC

Payload

S X X X

“E”

= = = =

Tag checkstill occursto validateway-pred

• May be hard to predict way if the same load accesses different addresses

Page 14: Advanced  Microarchitecture

14

Way-Prediction (2)• Organize data array s.t. left most way is

the MRU

Lecture 12: Caches and Memory

MRU LRUAccesses

Way-predict the MRU wayWay-prediction keeps hitting

On way-miss, move blockto MRU position

Way-prediction continuesto hit

Way-Miss (Cache Hit)

Complication: data array needs datapathfor swapping blocks (maybe 100’s of bits)

Normally just update a few LRU bits inthe tag array (< 10 bits?)

Page 15: Advanced  Microarchitecture

15

Partial Tagging• Like BTBs, just use part of the tag

Lecture 12: Caches and Memory

= = = = = = = =

Tag array lookupnow much faster!

Partial tags lead to false hits:Tag 0x45120001 looks like a hit

for Address 0x3B120001 Similar to way-prediction, full tagcomparison still needed to verify“real” hit --- not on critical path

Page 16: Advanced  Microarchitecture

16

… in the LSQ• Partial tagging can be used in the LSQ as

well

Lecture 12: Caches and Memory

Do address check onpartial addresses only

On a partial hit,forward the dataSlower complete

tag check verifiesthe match/no match

Replay or flushas needed

If a store finds a later partially-matchedload, don’t do pipeline flush right away

Penalty is too severe, wait for slowcheck before flushing the pipe

Page 17: Advanced  Microarchitecture

17

Interaction With Scheduling• Bank conflicts, way-mispredictions, partial-

tag false hits– All change the latency of the load instruction

– Increases frequency of replays• more “replay conditions” exist/encountered

– Need careful tradeoff between• performance (reducing effective cache latency)• performance (frequency of replaying instructions)• power (frequency of replaying instructions)

Lecture 12: Caches and Memory

Page 18: Advanced  Microarchitecture

18

Alternatives to Adding Associativity• More Set-Assoc needed when number of

items mapping to same cache set > number of ways

• Not all sets suffer from high conflict rates

• Idea: provide a little extra associativity, but not for each and every set

Lecture 12: Caches and Memory

Page 19: Advanced  Microarchitecture

19

Victim Cache

Lecture 12: Caches and Memory

A B C DE

X Y Z

J K L MN

A B CD

J K LM

A B C D

X Y Z

P Q RJ K L M

A B

J

VictimCache

AE

BC

N

JK

CD

K

L

L

M

Every access is a miss!ABCED and JKLMN

do not “fit” in a 4-wayset associative cache

Victim cache providesa “fifth way” so long asonly four sets overflowinto it at the same time

Can even provide 6th

or 7th … ways

P Q R

Page 20: Advanced  Microarchitecture

20

Skewed Associativity

Lecture 12: Caches and Memory

ABCD

WXYZ

A X Y C B DW Z

Lots of misses

Regular Set-Associative Cache

AX

YC

Skewed-Associative Cache

BD

WZ

Fewer of misses

Page 21: Advanced  Microarchitecture

21

Required Associativity Varies• Program stack needs very little

associativity– spatial locality

• stack frame is laid out sequentially• function usually only refers to own stack frame

Lecture 12: Caches and Memory

f()

g()

h()

j()k()

Call Stack

Addresseslaid out inlinearorganization

MRU LRU

Layout in 4-way Cache

Associativity not being used effectively

Page 22: Advanced  Microarchitecture

22

Stack Cache

Lecture 12: Caches and Memory

f()

g()

h()

j()k()

“Nice” stackaccesses

Disorganizedheap accesses

Lots of conflicts!

“Regular”Cache

Stack Cache

Page 23: Advanced  Microarchitecture

23

Stack Cache (2)• Stack cache portion can be a lot simpler

due to direct-mapped structure– relatively easily prefetched for by monitoring

call/retn’s• “Regular” cache portion can have lower

associativity– doesn’t have conflicts due to stack/heap

interaction

Lecture 12: Caches and Memory

Page 24: Advanced  Microarchitecture

24

Stack Cache (3)• Which cache does a load access?

– Many ISA’s have a “default” stack-pointer register

Lecture 12: Caches and Memory

LDQ 0[$sp]LDQ 12[$sp]

LDQ 8[$t3]

LDQ 24[$sp]

LDQ 0[$t1]

Stack Cache

Regular Cache

MOV $t3 = $sp

X Need stack base and offsetinformation, and then needto check each cache accessagainst these bounds

Wrong cache replay

Page 25: Advanced  Microarchitecture

25

Multi-Lateral Caches• Normal cache is “uni-lateral” in that

everything goes into the same place

• Stack cache is an example of “multi-lateral” caches– multiple cache structures with disjoint contents– I$ vs. D$ could be considered multi-lateral

Lecture 12: Caches and Memory

Page 26: Advanced  Microarchitecture

26

Access Patterns• Stack cache showed how different loads

exhibit different access patterns

Lecture 12: Caches and Memory

Stack(multiple push/pop’s

of frames)

Heap(heavily data-dependent

access patterns)

Streaming(linear accesses

with low/no reuse)

Page 27: Advanced  Microarchitecture

27

Low-Reuse Accesses• Streaming

– once you’re done decoding MPEG frame, no need to revisit

• Other

Lecture 12: Caches and Memory

Fields map to different cache lines

struct tree_t { int valid; int other_fields[24]; int num_children; struct tree_t * children;};

while(some condition) { struct tree_t * parent = getNextRoot(…); if(parent->valid) { doTreeTraversalStuff(parent); doMoreStuffToTree(parent); pickFruitFromTree(parent); }}

parent->valid accessed once,and then not used again

Page 28: Advanced  Microarchitecture

28

Filter Caches• Several proposed variations

– annex cache, pollution control cache, etc.

Lecture 12: Caches and Memory

SmallFilterCache

MainCache

Fill on miss

First-time missesare placed in filtercache If accessed again, promote

to the main cache

If not accessed again, eventually LRU’d out

Main cache only containslines with proven reuse

One-time-use lines havebeen filtered out

Can be thought of as the“dual” of the victim cache

Page 29: Advanced  Microarchitecture

29

Trouble w/ Multi-Lateral Caches• More complexity

– load may need to be routed to different places• may require some form of prediction to pick the right

one– guessing wrong can cause replays

• or accessing multiple in parallel increases power– no bandwidth benefit

– more sources to bypass from• costs both latency and power in bypass network

Lecture 12: Caches and Memory

Page 30: Advanced  Microarchitecture

30

Memory-Level Parallelism (MLP)• What if memory latency is 10000 cycles?

– Not enough traditional ILP to cover this latency– Runtime dominated by waiting for memory– What matters is overlapping memory accesses

• MLP: “number of outstanding cache misses [to main memory] that can be generated and executed in an overlapped manner.”

• ILP is a property of a DFG; MLP is a metric– ILP is independent of the underlying execution

engine– MLP is dependent on the microarchitecture

assumptions– You can measure MLP for uniprocessor, CMP,

etc.Lecture 12: Caches and Memory

Page 31: Advanced  Microarchitecture

31

uArchs for MLP• WIB – Waiting Instruction Buffer

Lecture 12: Caches and Memory

Scheduler

Load miss

No instructions inforward slice canexecute

Eventually all independentinsts issue and schedulercontains only insts in the

forward slice… stalled

Scheduler

WIB

Loadmiss

Move forward sliceto separate buffer

Independent instscontinue to issue

New insts keepthe scheduler busy

Eventually exposeother independentload misses (MLP)

Page 32: Advanced  Microarchitecture

32

WIB Hardware• Similar to replay – continue issuing

dependent instructions, but need to shunt to the WIB

• WIB hardware can potentially be large– WIB doesn’t do scheduling – no CAM logic

needed

• Need to redispatch from WIB back into RS when load comes back from memory– like redispatching from replay-queue

Lecture 12: Caches and Memory