Upload
eleanore-hunt
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
A Fully Associative, Tagless DRAM Cache
Yongjun Lee† Jongwon Kim† Hakbeom Jang† Hyunggyun Yang‡ Jangwoo Kim‡ Jinkyu Jeong† Jae W. Lee†
June 15th 2015
‡POSTECH†Sungkyunkwan University
ISCA-42, Portland, Oregon
2Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Die-stacked DRAM technologies
• Promising solution to the memory wall problem
− Interconnects logic and DRAM dies using TSVs (3D) and silicon interposers (2.5D)
− Adopted by major processor and DRAM vendors for throughput and energy efficiency
In-package DRAM
SI-interposer
Proce
ssor d
ie
3Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Die-stacked DRAM technologies: How to use in-package DRAM?
• Two options to use it
− As large DRAM cache
− As part of main memory
In-packageDRAM
SI-interposer
Off-package DRAM
- This work
Proce
ssor d
ie
4Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
• Cache tags incur large storage and energy overhead
− On-die SRAM tags are expensive [MICRO’11] [MICRO’12]
− This overhead is still significant even with page-based caches
Problems with tags (1): Storage/energy overhead
64B block-based cache 4KB page-based cache
32MB128MB
Proce
ssor d
ie
Tags
Siz
e o
f S
RA
M t
ag
(M
B)
Siz
e o
f S
RA
M t
ag
(M
B)
5Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Problems with tags (2): Latency overhead
Core
TLB L1/L2
Tags
In-packageDRAM
• Cache tags increase cache access latency
− Two-step address translation: TLB (VA-to-PA) and cache tags (PA-to-CA)
− A larger DRAM cache leads to longer PA-to-CA translation time
Virtual-to-physical(VA-to-PA)
Physical-to-cache(PA-to-CA)
Latency penalty
Tag
s L
aten
cy
DRAM Cache Size
Our proposal: Tagless DRAM cache!
6Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Our proposal: Tagless DRAM cache
• Key ideas
− No tags: to eliminate storage/energy/latency overhead for tags
− Page-based caching: with block size equal to OS page size (4KB)
− Cache-map TLB (cTLB): to stores virtual-to-cache address (VA-to-CA) mappings
− TLB miss handler: to establish a VA-to-CA mapping in a single step by
consolidating page table walk with cache block allocation
Core
TLB L1/L2
Tags
In-packageDRAM
Virtual-to-physical
No tags
Cache-mapTLB
cache
7Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Outline
• Motivation and Key Ideas
• Tagless DRAM Cache: Operations and Organization
− Cache Operations
− Cache Organization
− Summary
• Evaluation
• Conclusion
8Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache operations (1): Three major components
• cTLB: VA-to-CA mappings (for a subset of cached blocks)
• Page table: VA-to-CA (cached blocks) or VA-to-PA (uncached blocks) mappings
• Global inverted page table (GIPT): CA-to-PA mappings (for all cached blocks)
cTLB
Die-stacked DRAM cache
L1/L2Cache
Processor Pipeline
Off-packageDRAM
PageTable
GIPT
On-die
In-package
Off-package
9Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
• Case 1: cTLB miss & cache miss
− TLB miss handler performs both page table walk and cache fill
− cTLB holds the newly established VA-to-CA mapping
Cache operations (2): Cache miss path
Die-stacked DRAM cache
L1/L2Cache
Processor Pipeline
Off-packageDRAM
VA1
VA1
cTLB
PageTable
VA1
GIPT
CA1 FreePA1
CA1
PA1CA1
VA1
cTLB miss1
Page table walk2
Allocate cache block3
Data Block (4KB)
Data Block (4KB)
Update PTE4
Update cTLB5
PA1
CA1VA1 CA1
10Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache operations (3): Cache hit path
• Case 2: cTLB hit & cache hit
− TLB hit guarantees cache hit
− cTLB directly translates VA to CA – achieves lowest hit latency!
Die-stacked DRAM cache
L1/L2Cache
Processor Pipeline
Off-packageDRAM
VA1
VA1
cTLB
PageTable
VA1
GIPT
CA1 PA1
CA1
PA1CA1
VA1
CA1
CA1
cTLB hit1
Data Block (4KB)CA1
11Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache operations (4): Victim cache hit path
• Case 3: cTLB miss & cache hit
− Typically, cTLB reach is smaller than DRAM cache size
− DRAM cache space beyond cTLB reach is used as victim cache for those blocks
recently evicted from cTLB
CA0
cTLB
Page Table
CA1
CA2
CA3
CA4
CA5
CA1VA1
CA4VA4
CA0VA0
CA1VA1
CA2VA2
CA3VA3
CA4VA4
CA5VA5
DRAM cache
Cached but not in TLB(Victim block)
What if VA2 is requested?
12Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache operations (4): Victim cache hit path
• Case 3: cTLB miss & cache hit (cont’d)
− If PTE has a mapping to CA, the block is currently cached (victim hit)
− No additional latency penalty
Die-stacked DRAM cache
L1/L2Cache
Processor Pipeline
Off-packageDRAM
VA2
VA2
cTLB
PageTable
VA2
GIPT
CA2 PA2
CA2
CA2
VA2
CA2
CA2
cTLB miss1
Page table walk2 Update cTLB3VA2 CA2 Data Block (4KB)CA2
13Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache operations (5): Cache block eviction
• Case 4: Asynchronous cache block eviction
− OS ensures a small number of free blocks always available
− Takes cache block writeback request off from the common access path
Die-stacked DRAM cache
L1/L2Cache
Processor Pipeline
Off-packageDRAM
VA1
cTLB
PageTable
VA5
GIPT
CA5 PA5
CA1
CA5PA5 Free
Write back or drop2
Lock the PTE using PTE pointer1
Update PTE and unlock3
Data Block (4KB
Data Block (4KB)
PA5
CA5
14Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
• Case 5: cTLB hit & cache miss
− Only occurs to pages declared to be non-cacheable (e. g., using mmap)
− cTLB entry holds VA-to-PA mapping and memory request bypasses DRAM cache
Cache operations (6): Non-cacheable block access
Die-stacked DRAM cache
L1/L2Cache
Processor Pipeline
Off-packageDRAM
VA3
VA3
cTLB
PageTable
VA3
GIPT
CA1 PA1PA1PA3
VA3
PA3
PA3
PA3cTLB hit1
Data Block (64B)
Data Block (64B)
PA3
PA3
15Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Outline
• Motivation and Key Ideas
• Tagless DRAM Cache: Operations and Organization
− Cache Operations
− Cache Organization
− Summary
• Evaluation
• Conclusion
16Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache organization (1): Page table entry (PTE)
• PTE has three additional flags
− Valid-in-cache (VC): indicates whether the page is currently cached or not
− Non-cacheable (NC): indicates whether the page bypasses the DRAM cache or not
− Pending Update (PU): indicates that the PTE will soon be updated
VA1 PA1PA6VA2 PA1PA2VA3 PA1CA3VA4 PA1CA5VA5 PA1PA1VA6 PA1PA4
FlagsFlagsFlagsFlagsFlagsFlags
Page Table
Physical Page NumberOr Cache Page Number
PU
NC
VC
0: Not cached yet1: Cached in DRAM cache
0: Cacheable 1: Non-cacheable (bypass DRAM)
0: Ready1: Busy (PTE being updated)
Indexed by VA
17Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache organization (2): GIPT entry
• GIPT entry has three fields
− Physical page number (PPN): maintains CA-to-PA mapping
− Page table entry pointer (PTEP): points to the corresponding PTE
− TLB residence bit vector: indicates which core holds this page in its TLB
• Victim for eviction is selected among those not residing in any TLB
− To prevent cache block eviction from triggering TLB shootdowns
− FIFO policy (by default)
CA1 PA1PA1CA2 PA1PA3CA3 PA1PA4CA4 PA1PA5
PTEPPTEPPTEPPTEP
GIPTTLBsTLBsTLBsTLBs
Physical Page Number(Physical Address)
Page Table Entry Pointer(PTEP)
TLB residence bit vector
36bit 42bit Number of TLBs
Indexed by CA
18Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Cache organization (3): GIPT vs. cache tag array
• Tagless DRAM cache eliminates tags, but introduces GIPT
− GIPT has much lower cost than cache tag array!
Cache tag array GIPT
Access frequencyAt every cache access(for both hit and miss)
At cache miss only
Storage overhead(For 8GB DRAM
cache)
16MB On-die SRAM
[Baseline]
48MB On-die SRAM
[ISCA’13]
256MB In-package
DRAM[MICRO’14]
20MB Off-package
DRAM[This Work]
[MICRO’14] Jevdjic et al., Unison cache: A scalable and effective die-stacked DRAM cache[ISCA’13] Jevdjic et al , Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache
19Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Summary (1): Latency breakdown
• Tagless cache vs. SRAM-tag cache: Latency comparison
TLB On-die cache SRAM tag
cTLB On-die cache
TLB On-die cache SRAM tag
cTLB On-die cacheGIPT update
PT walks
PT walks
Saved
Penalty
Hit latency (common case)
Miss latency (rare case)
SRAM-tag
Tagless
SRAM-tag
Tagless
Two memory accesses
Reduces hit time at the cost of slight increase in miss penalty
Drawn not to scale
20Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Summary (2): Comparison of different cache designs
• Design requirements for large DRAM cache
• Over-fetching problem can be mitigated by adopting:
− Singleton prediction [MICRO‘14]
− Hot/cold page tracking [HPCA’10]
Design Requirement Block-based(64B/block)
Page-based(4KB/block)
This work(4KB/block)
Small tag storage Bad Good Best
High hit ratio Bad Good Best
Low hit latency Bad Good Best
High DRAM row buffer locality Bad Good Good
Minimal over-fetching Good Bad Bad
[MICRO’14] Jevdjic et al., Unison cache: A scalable and effective die-stacked DRAM cache[HPCA‘10] Jiang et al. CHOP: Adaptive filter-based dram caching for CMP server platforms
21Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Outline
• Motivation and Key Ideas
• Tagless DRAM Cache: Operations and Organization
• Evaluation
− Methodology
− Single-programmed workloads
− Multi-programmed workloads
• Conclusion
22Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Evaluation methodology (1)
• Simulator: McSimA+ [ISPASS’13]
• DRAM timing and energy parameters [SC’14]
Component Parameters
CPU Out-of-Order, 4 cores, 3GHz
L1/L2 TLB 32I/32D entries per core / 512 entries per core
L1/L2 Cache 4 way, 32KB/32KB per core / 16 way, 2MB per core,
SRAM-tag 16 way, 256K entries, 11cycle
In-package DRAM(1GB)
DDR 3.2GHz, 1 channel, 2 ranks16 banks per rank, 128 bits per channel
Off-package DRAM(8GB)
DDR 1.6GHz, 1 channel, 2 ranks64 bank per rank, 64 bits per channel
Parameters DRAM cache
Off-package DRAM
I/O energyRD or WR energy without I/OACT+PRE energy (4KB page)
2.4pJ/b4pJ/b15nJ
20pJ/b13pJ/b15nJ
Activate to read delay (tRCD)Read to first data delay (tAA)
Activate to precharge delay (tRAS)Precharge command preiod (tRP)
8ns10ns22ns14ns
14ns14ns35ns14ns
[ISPASS’13] Ahn et al., McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling[SC ‘14] Son et al. Microbank: architecting through-silicon interposer-based main memory systems
23Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Evaluation methodology (2)
• Workloads
− Single-programmed workloads: SPEC CPU 2006
− Multi-programmed workloads: SPEC CPU 2006
− Multi-threaded workloads: PARSEC (not shown)
• Comparison of five designs
− No L3 cache (No L3): Baseline system
− Bank-Interleaving (BI): Use 3D DRAM as part of memory (bank-interleaved)
− SRAM-tag (SRAM): Page-based DRAM cache with SRAM tags
− Tagless (cTLB): Our proposed tagless DRAM cache
− Ideal: Ideal DRAM cache model
24Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Single-programmed workloads: IPC and EDP
• IPC improvements: 16.4% over No L3, 7.3% over SRAM-tag
• EDP improvements: 35.8% over No L3, 26.5% over SRAM-tag
Geometric mean of CPU2006 benchmarks (11 workloads)
[410.bwaves, 429.mcf, 433.milc, 437.leslie3d, 450.soplex, 459.GemsFDTD, 462.libquantum, 470.lbm, 471.omnetpp, 482.sphinx3, 483.xalancbmk]
No L3
25Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Single-programmed workloads: Average L3 latency
• Up to 16% lower than SRAM-tag (462. libquantum)
− Average reduction of 10%
Hit rate: ~99%
In-package DRAM access time
Geo
mea
n
26Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Multi-programmed workloads
• IPC improvements: 24.9% over No L3, 2.6% over SRAM-tag
• EDP improvements: 41.5% over No L3, 21.3% over SRAM-tag
Geometric mean of CPU2006 benchmarks (8 groups)
27Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
Summary
• We introduce the tagless DRAM cache featuring:
− Lowest hit latency with no tag checking on hit path
− Low average memory access time due to low hit latency and full associativity
− Zero tag storage overhead from either on-die SRAM or in-package DRAM
− High energy efficiency with no energy waste from tags
− Flexibility to plug in a new caching policy by simply modifying TLB miss handler
• The proposed tagless DRAM cache effectively eliminates one of the
most serious scalability bottlenecks for the future multi-gigabyte
DRAM caches.
28
Thank you
29Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH
More in paper
• Detail configuration
• Non-cacheable flag
−Minimize over-fetching problems
−Shared page support
−Superpages
• TLB residence bit vector
−Reduce TLB shootdown
• Multi-socket support
• More Results