Upload
jeffry-lucas
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Caching Considerations for Generational Garbage
Collection
Presented By:Felix Gartsman – 306054172
http://www.cs.tau.ac.il/~gartsma/seminar.ppt
Introduction
Main theme: Effect of memory caches on GC performance
What is a memory cache? How caches work? How and why caches and GC
interact? Can we boost GC performance by
knowing more about caches?
Motivation
CPU and memory performance don’t advance with same speed
When CPU waits for memory it is idle
Solutions: Pipeline, speculative execution and caches
Caches provide fast access for commonly accessed memory
Caches and GC
Two-way relationship: Improving GC performance by
“cache awareness” – minimizing cache misses
GC improving mutator memory access locality and minimizes cache misses by mutator (not dealt by the article)
Previous Work (Outdated!)
Deal mainly with interaction with virtual memory systems
No special attention to Generational GC
Assumed “best/worst cases”, special hardware
Investigated only direct-mapped caches
Article Contribution
Survey GGC performance on various caches
Check techniques for improving performance
Main advice: Try keeping the youngest generation fully in cache. If impossible prefer associative caches
Roadmap
Cache in-depth GC memory reuse cycle GGC as better GC Comparing cache size requirements Comparing misses for different
cache types Conclusions
Cache in-depth
RegistersRegisters
CacheCache
Virtual Memory (Disk)Virtual Memory (Disk)
Memory Hierarchy
L1L1
L2L2
Cache Hierarchy
L3???L3???
Higher level means higher speed and smaller capacity
Miss in one level relays the handling to a lower level
Main MemoryMain Memory
Motivation contd. When a memory word is not in cache, a
“cache miss” occurs Cache miss stalls the CPU, and forces
access to main memory Cache misses are expensive Cache misses become more expensive
with each new generation of CPUs Penalty for memory access in P4: L1 – 2
cycles, L2 – 7, miss – dozens depending on memory type
Cache properties Size (8-64KB in L1, 128KB-3MB in
L2, 6-8MB in L3?) Layout (block size and sub-blocks) Placement (N:M hash function) Associativity Write strategy
Write-through or Write-back Fetch-on-write or write-around
Cache Size
Size – The bigger the better. Too small cache can render fast CPU to sluggish (Intel Celeron as example)
Bigger cache reduces cache misses Constraints:
Physical feasibility (proximity, size, heat) Money (cost vs. performance ratio)
Cache Layout
Cache memory is divided to blocks called “cache lines”
Each line contains validity bit, dirty bit, replacement policy bits, address tag and of course the data
Bigger block reduce misses for good spatial locality. Hurt performance if working on multiple memory regions. Also longer to fill lines
Cache Layout contd.
Can be solved by dividing lines to sub-blocks and managing them separately
Cache Placement
Map memory address to block number Examples:
Address modulo #blocks Select middle bits of address Select a set of bits
Must be fast and “hardware friendly” Should be uniform mapping
Cache Associativity
Fully associative – each address can be in any block. Need to check all tags – slow or expensive. LRU replacement
Direct mapped – address can be only in one block. Fast lookup, but no usage history
Set associative - each address can be in a set (2,4,8) of blocks. A compromise – fast access and limited usage history
Cache Write Strategy
Write-Through – Write directly to memory and of course update the cache (slow, but can use write buffers)
Write-Back – Write to cache, and mark it dirty. Flush to memory later. Very useful for multiple writes to close addresses (object initialization). Can also enjoy write buffers (less useful)
Cache Write Strategy contd.
What to do on write cache miss? Fetch-on-write/Write-allocate – on
miss fetch the corresponding cache line, and treat it as write hit
Write-around/Write-no-allocate – Write directly to memory
Usually Write-back + Write-allocate, Write-through + Write-no-allocate
Modern memory usage
Object-Oriented languages tend to create many small objects for short periods. For example, STL uses value semantics which copies objects for every operation!
Functional languages (Lisp, Scheme) constantly create new objects which replace old ones (cons and friends…)
Modern memory usage contd.
Creation is expensive – allocation with probable write miss (new address used). Article cites sources claiming functional languages writing in up to 25% of their instructions (others 10%)
Memory Recycling Pattern
GC systems tend to violate locality assumptions
Cyclic reuse of memory beats any caching policy. The reuse cycle is too long to be captured
GC systems become bandwidth limited
Allocation is to blame, not GC Locality of the GC process itself is not
“the weakest link” The problem is fast allocation of
memory, which will be reclaimed much later
Main memory filled very fast. What to do?
1. GC – Too frequent, but avoids page2. Use VM – Touches many pages and causes
paging
Pattern Results Allocation touches new memory,
and force a page-in/page fetch (slow)
Why fetch? The memory allocated was used previously. OS doesn’t know it’s garbage, and allocation will overwrite it anyway
Informing OS no fetch required speeds execution
Pattern Results contd. When main exhausted (or the
process isn’t allowed more pages), old pages must be evicted
Those pages are probably dirty – must be written to disk
Even worse – the evicted page is LRU – probably garbage!
Worst case: Disk B/W = 2*Allocation Rate
Another view
View GC allocator as a co-process to the mutator
Each one has it own locality reference
The mutator probably with good spatial locality
The allocator linearly marches through memory
Allocation is cyclic (remember LRU)
Compaction and Semi-Spaces
Compaction helps the mutator, little difference to allocator
Still marches through large memory areas
Trouble with semi-spaces – the tospace was probably evicted. All addresses are replaced – cache flush. Marching through entire heap every second cycle
Solution?
So LRU is bad, can we replace it? We can, but it wont help much Too much memory touched too
frequently Allocator page faults dominate
program execution! Only holding entire reuse cycle in
memory will stop paging
Generational GC
Solution: Touch less memory, less frequently
Divide heap to generations GC the young generation(s) –
touching less memory This eliminates vast memory
marching – memory reuse cycle minimized
Eliminates paging, what about cache?
Generational GC variations
Can use single space – immediate promotion
Can use semi-spaces – promote at will, at the expense of more memory
Better Generational GC
Ungar: Use a pair of semi-spaces and a separate dedicated creation space
This space is emptied and reused every cycle, but the semi-spaces alternate roles as destination
The result: Only little part of semi-spaces are touched, and new objects are created in “hot” space in main memory (and maybe in cache)
Cache Revised
Cache misses can be categorized to:1. Capacity misses – No matter what
cache is used, the miss will occur2. Conflict misses – A miss occurs because
two (or more) addresses mapped to same cache line (set)
Direct mapped suffer from conflict misses the most – every miss evicts blocked with same mapping
Conflict Misses in-depth
Miss rate function is roughly a minimization one
Example: Both addresses map to same line. The first accessed every ms, the second every μs. The (double) miss is every ms.
The rate depends on the usage frequency of addresses not in cache
Minimizing Conflict Misses Rate
Most non-GC systems are skewed – many frequent objects, little others. If placed well, cache is efficient
If many block accessed in intermediate time scale – more misses, and more chances they will interfere each other
Over-simplified to help understanding
Example Program marches memory, while doing
normal activity. We use 16KB cache 2-way associative: the most frequent
block are not touched Direct Mapped: Total flush every cycle Conclusion: It takes twice time to be
remapped in DM, but the result is painful (flush)
DM can’t handle multiple access pattern
Experiments
Instrumented Scheme compiler with integrated cache simulator
Executes millions of instructions, allocates MBs
We’ll present 2 programs:1. Scheme compiler2. Boyer benchmark – objects live long,
and tend to be promoted
Experiments contd.
Cache lines are 16 bytes width 3 collectors:
1. GGC with 2 MB spaces for generation – no promotion ever done
2. GGC with 141 KB spaces for generation
3. #2 + 141 KB creation space (Ungar)
Results (Capacity)
Interpretation
LRU queue distance distribution What it means?
1. The probability of a block to be touched at different point in LRU queue
2. The probability of a block to be touched given how long since it was last touched
3. The probability of a block to be touched given how many other blocks have been more recently
Interpretation contd.
Fourth queue position – 128 KB Eight queue position – 256 KB For any given position – The area
under the curve to the left – cache hits, to the right – misses
Curve’s height at point – the marginal increase in hits due cache enlargement at that point
Experiment Meaning First entries
absorb most hits Collector #1:
Dramatic drop About tenth
position (320 KB) – no need
Collector #2+#3: Hump peaking
when memory starts recycling
Experiment Meaning contd.
#2: Recycling after 141*2 KB, cache of 300-400 KB should suffice
#3: Creation space is constantly recycled, and a small part of other spaces is touched, cache of 200-300 KB should suffice
Experiment Meaning contd. 2
Boyer behaves differently #3 better than #2 by 30% Capacity misses disappear if cache
larger than youngest generation size
Results (Collision)
Interpretation
The graph plots cache size vs. miss rate
Shows results only for collector #3
Experiment Meaning
Associative shows dramatic and almost linear dropdown to 256 KB (contains youngest generation). From then nothing interesting
Direct mapped same on 16-90 KB interval, better on 90-135 KB, much worse latter on
Experiment Meaning contd.
Why DM better at that period? Cache big enough to hold creation
area, and suffers interference for other blocks
Associative evicts before used due collision
Later associative suffers only “re-fill” misses, DM also suffers collisions
More Performance Notes
When cache is too small, most evicted blocks are dirty and require expensive writebacks
Interference may also cause writebacks
Conclusions
Caches are important part of modern computer
Garbage collectors reuse memory in cycles, often march memory
LRU evicts dirty pages/cache lines, needless fetches are costly
GGC reuses smaller area, reduces paging
Conclusions contd.
Similar idea for caches: hold youngest generation entirely
Ungar 3-space proposition reduces required footprint by 30%
Excluding small interval, associative caches perform better than direct mapped which suffer collision misses
Questions?
The End