Upload
dezso
View
59
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Advanced Microarchitecture. Lecture 12: Caches and Memory. 1. 1. “6T SRAM” cell 2 access gates 2T per inverter. b. b. SRAM {Over|Re}view. Chained inverters maintain a stable state Access gates provide access to the cell - PowerPoint PPT Presentation
Citation preview
Advanced MicroarchitectureLecture 12: Caches and Memory
2
SRAM {Over|Re}view
• Chained inverters maintain a stable state• Access gates provide access to the cell• Writing to a cell involves over-powering the
two small storage invertersLecture 12: Caches and Memory
1 00 1
1 1
b b“6T SRAM” cell
2 access gates2T per inverter
3
64×1-bit SRAM Array Organization
Lecture 12: Caches and Memory 1-of-8 Decoder
1-of-8 Decoder
Why are we readingboth b and b?
“Wordline”
“Bitlines”
“ColumnMux”
4
SRAM Density vs. Speed• 6T cell must be small as possible to have
dense storage– Bigger caches– Smaller transistors slower transistors
Lecture 12: Caches and Memory
*Long* metal line with alot of parasitic loading
So dinky inverters cannotdrive their outputs very
quickly…
5
Sense Amplifiers• Type of differential amplifier
– Two inputs, amplifies the difference
Lecture 12: Caches and Memory
DiffAmp
X
Ya × (X – Y) + Vbias
b
b
Bitlinesprecharged
To Vdd
Wordlineenabled
Small cell dischargesbitline very slowly
Sense amp “sees” the differencequickly and outputs b’s value
Sometimes prechargebitlines to Vdd/2 which
makes a bigger “delta”for faster sensing
6
Multi-Porting
Lecture 12: Caches and Memory
b1 b1
Wordline1
b2 b2
Wordline2
Wordlines = 2 × portsBitlines = 4 × ports Area = O(ports2)
7
Port Requirements• ARF, PRF, RAT all need many read and write
ports to support superscalar execution– Luckily, these have limited number of
entries/bytes
• Caches also need multiple ports– Not as many ports– But the overall size is much larger
Lecture 12: Caches and Memory
8
Delay Of Regular Caches• I$
– low port requirement (one fetch group/$-line per cycle)
– latency only exposed on branch mispredict• D$
– higher port requirement (multiple LD/ST per cycle)
– latency often on critical path of execution• L2
– lower port requirement (most accesses hit in L1)
– latency less important (only observed on L1 miss)
– optimizing for hit rate usually more important than latency• difference between L2 latency and DRAM latency is
large
Lecture 12: Caches and Memory
9
Banking
Lecture 12: Caches and Memory
DecoderDecoderDecoderDecoder
SRAMArray
SenseSenseSenseSense
ColumnMuxing
Big4-portedL1 DataCache
SDecoder
SRAMArray
SDecoder
SRAMArray
SDecoder
SRAMArray
SDecoder
SRAMArray
4 Banks, 1 port eachEach bank is much faster
Slow due to quadraticarea growth
10
Bank Conflicts• Banking provides high bandwidth• But only if all accesses are to different
banks
• Banks typically address interleaved– For N banks– Addr bank[Addr % N]
• Addr on cache line granularity
– For 4 banks, 2 accesses, chance of conflict is 25%
– Need to match # banks to access patterns/BWLecture 12: Caches and Memory
11
Associativity• You should know this already
Lecture 12: Caches and Memory
foo’s value
foo
direct mapped
foo’s value
foo
foo
fully associativeRAM CAM
foo
foo’s valuefoo
set associativeCAM/RAM hybrid?
12
Set-Associative Caches• Set-associativity good for reducing conflict
misses• Cost: slower cache access
– often dominated by the tag array comparisons– Basically mini-CAM logic
Lecture 12: Caches and Memory
• Must trade off:– Smaller cache size– Longer latency– Lower associativity
• Every option hurts performance
= = = =
40-50 bitcomparison!
13
Way-Prediction• If figuring out the way takes too long, then
just guess!
Lecture 12: Caches and Memory
WayPred
LoadPC
Payload
S X X X
“E”
= = = =
Tag checkstill occursto validateway-pred
• May be hard to predict way if the same load accesses different addresses
14
Way-Prediction (2)• Organize data array s.t. left most way is
the MRU
Lecture 12: Caches and Memory
MRU LRUAccesses
Way-predict the MRU wayWay-prediction keeps hitting
On way-miss, move blockto MRU position
Way-prediction continuesto hit
Way-Miss (Cache Hit)
Complication: data array needs datapathfor swapping blocks (maybe 100’s of bits)
Normally just update a few LRU bits inthe tag array (< 10 bits?)
15
Partial Tagging• Like BTBs, just use part of the tag
Lecture 12: Caches and Memory
= = = = = = = =
Tag array lookupnow much faster!
Partial tags lead to false hits:Tag 0x45120001 looks like a hit
for Address 0x3B120001 Similar to way-prediction, full tagcomparison still needed to verify“real” hit --- not on critical path
16
… in the LSQ• Partial tagging can be used in the LSQ as
well
Lecture 12: Caches and Memory
Do address check onpartial addresses only
On a partial hit,forward the dataSlower complete
tag check verifiesthe match/no match
Replay or flushas needed
If a store finds a later partially-matchedload, don’t do pipeline flush right away
Penalty is too severe, wait for slowcheck before flushing the pipe
17
Interaction With Scheduling• Bank conflicts, way-mispredictions, partial-
tag false hits– All change the latency of the load instruction
– Increases frequency of replays• more “replay conditions” exist/encountered
– Need careful tradeoff between• performance (reducing effective cache latency)• performance (frequency of replaying instructions)• power (frequency of replaying instructions)
Lecture 12: Caches and Memory
18
Alternatives to Adding Associativity• More Set-Assoc needed when number of
items mapping to same cache set > number of ways
• Not all sets suffer from high conflict rates
• Idea: provide a little extra associativity, but not for each and every set
Lecture 12: Caches and Memory
19
Victim Cache
Lecture 12: Caches and Memory
A B C DE
X Y Z
J K L MN
A B CD
J K LM
A B C D
X Y Z
P Q RJ K L M
A B
J
VictimCache
AE
BC
N
JK
CD
K
L
L
M
Every access is a miss!ABCED and JKLMN
do not “fit” in a 4-wayset associative cache
Victim cache providesa “fifth way” so long asonly four sets overflowinto it at the same time
Can even provide 6th
or 7th … ways
P Q R
20
Skewed Associativity
Lecture 12: Caches and Memory
ABCD
WXYZ
A X Y C B DW Z
Lots of misses
Regular Set-Associative Cache
AX
YC
Skewed-Associative Cache
BD
WZ
Fewer of misses
21
Required Associativity Varies• Program stack needs very little
associativity– spatial locality
• stack frame is laid out sequentially• function usually only refers to own stack frame
Lecture 12: Caches and Memory
f()
g()
h()
j()k()
Call Stack
Addresseslaid out inlinearorganization
MRU LRU
Layout in 4-way Cache
Associativity not being used effectively
22
Stack Cache
Lecture 12: Caches and Memory
f()
g()
h()
j()k()
“Nice” stackaccesses
Disorganizedheap accesses
Lots of conflicts!
“Regular”Cache
Stack Cache
23
Stack Cache (2)• Stack cache portion can be a lot simpler
due to direct-mapped structure– relatively easily prefetched for by monitoring
call/retn’s• “Regular” cache portion can have lower
associativity– doesn’t have conflicts due to stack/heap
interaction
Lecture 12: Caches and Memory
24
Stack Cache (3)• Which cache does a load access?
– Many ISA’s have a “default” stack-pointer register
Lecture 12: Caches and Memory
LDQ 0[$sp]LDQ 12[$sp]
LDQ 8[$t3]
LDQ 24[$sp]
LDQ 0[$t1]
Stack Cache
Regular Cache
MOV $t3 = $sp
X Need stack base and offsetinformation, and then needto check each cache accessagainst these bounds
Wrong cache replay
25
Multi-Lateral Caches• Normal cache is “uni-lateral” in that
everything goes into the same place
• Stack cache is an example of “multi-lateral” caches– multiple cache structures with disjoint contents– I$ vs. D$ could be considered multi-lateral
Lecture 12: Caches and Memory
26
Access Patterns• Stack cache showed how different loads
exhibit different access patterns
Lecture 12: Caches and Memory
Stack(multiple push/pop’s
of frames)
Heap(heavily data-dependent
access patterns)
Streaming(linear accesses
with low/no reuse)
27
Low-Reuse Accesses• Streaming
– once you’re done decoding MPEG frame, no need to revisit
• Other
Lecture 12: Caches and Memory
Fields map to different cache lines
struct tree_t { int valid; int other_fields[24]; int num_children; struct tree_t * children;};
while(some condition) { struct tree_t * parent = getNextRoot(…); if(parent->valid) { doTreeTraversalStuff(parent); doMoreStuffToTree(parent); pickFruitFromTree(parent); }}
parent->valid accessed once,and then not used again
28
Filter Caches• Several proposed variations
– annex cache, pollution control cache, etc.
Lecture 12: Caches and Memory
SmallFilterCache
MainCache
Fill on miss
First-time missesare placed in filtercache If accessed again, promote
to the main cache
If not accessed again, eventually LRU’d out
Main cache only containslines with proven reuse
One-time-use lines havebeen filtered out
Can be thought of as the“dual” of the victim cache
29
Trouble w/ Multi-Lateral Caches• More complexity
– load may need to be routed to different places• may require some form of prediction to pick the right
one– guessing wrong can cause replays
• or accessing multiple in parallel increases power– no bandwidth benefit
– more sources to bypass from• costs both latency and power in bypass network
Lecture 12: Caches and Memory
30
Memory-Level Parallelism (MLP)• What if memory latency is 10000 cycles?
– Not enough traditional ILP to cover this latency– Runtime dominated by waiting for memory– What matters is overlapping memory accesses
• MLP: “number of outstanding cache misses [to main memory] that can be generated and executed in an overlapped manner.”
• ILP is a property of a DFG; MLP is a metric– ILP is independent of the underlying execution
engine– MLP is dependent on the microarchitecture
assumptions– You can measure MLP for uniprocessor, CMP,
etc.Lecture 12: Caches and Memory
31
uArchs for MLP• WIB – Waiting Instruction Buffer
Lecture 12: Caches and Memory
Scheduler
Load miss
No instructions inforward slice canexecute
Eventually all independentinsts issue and schedulercontains only insts in the
forward slice… stalled
Scheduler
WIB
Loadmiss
Move forward sliceto separate buffer
Independent instscontinue to issue
New insts keepthe scheduler busy
Eventually exposeother independentload misses (MLP)
32
WIB Hardware• Similar to replay – continue issuing
dependent instructions, but need to shunt to the WIB
• WIB hardware can potentially be large– WIB doesn’t do scheduling – no CAM logic
needed
• Need to redispatch from WIB back into RS when load comes back from memory– like redispatching from replay-queue
Lecture 12: Caches and Memory