A Fully Associative, Tagless DRAM Cache Yongjun Lee † Jongwon Kim † Hakbeom Jang † Hyunggyun Yang ‡ Jangwoo Kim ‡ Jinkyu Jeong † Jae W. Lee † June 15 th

A Fully Associative, Tagless DRAM Cache

Yongjun Lee† Jongwon Kim† Hakbeom Jang† Hyunggyun Yang‡ Jangwoo Kim‡ Jinkyu Jeong† Jae W. Lee†

June 15th 2015

‡POSTECH†Sungkyunkwan University

ISCA-42, Portland, Oregon

2Parallel Architecture & Programming Laboratory @ SKKU HPC Laboratory @ POSTECH

Die-stacked DRAM technologies

• Promising solution to the memory wall problem

− Interconnects logic and DRAM dies using TSVs (3D) and silicon interposers (2.5D)

− Adopted by major processor and DRAM vendors for throughput and energy efficiency

In-package DRAM

SI-interposer

Proce

ssor d

ie


Die-stacked DRAM technologies: How to use in-package DRAM?

• Two options to use it

− As large DRAM cache

− As part of main memory

In-packageDRAM

SI-interposer

Off-package DRAM

- This work

Proce

ssor d

ie


• Cache tags incur large storage and energy overhead

− On-die SRAM tags are expensive [MICRO’11] [MICRO’12]

− This overhead is still significant even with page-based caches

Problems with tags (1): Storage/energy overhead

64B block-based cache 4KB page-based cache

32MB128MB

Proce

ssor d

ie

Tags

Siz

e o

f S

RA

M t

ag

(M

B)

Siz

e o

f S

RA

M t

ag

(M

B)


Problems with tags (2): Latency overhead

Core

TLB L1/L2

Tags

In-packageDRAM

• Cache tags increase cache access latency

− Two-step address translation: TLB (VA-to-PA) and cache tags (PA-to-CA)

− A larger DRAM cache leads to longer PA-to-CA translation time

Virtual-to-physical(VA-to-PA)

Physical-to-cache(PA-to-CA)

Latency penalty

Tag

s L

aten

cy

DRAM Cache Size

Our proposal: Tagless DRAM cache!


Our proposal: Tagless DRAM cache

• Key ideas

− No tags: to eliminate storage/energy/latency overhead for tags

− Page-based caching: with block size equal to OS page size (4KB)

− Cache-map TLB (cTLB): to stores virtual-to-cache address (VA-to-CA) mappings

− TLB miss handler: to establish a VA-to-CA mapping in a single step by

consolidating page table walk with cache block allocation

Core

TLB L1/L2

Tags

In-packageDRAM

Virtual-to-physical

No tags

Cache-mapTLB

cache


Outline

• Motivation and Key Ideas

• Tagless DRAM Cache: Operations and Organization

− Cache Operations

− Cache Organization

− Summary

• Evaluation

• Conclusion


Cache operations (1): Three major components

• cTLB: VA-to-CA mappings (for a subset of cached blocks)

• Page table: VA-to-CA (cached blocks) or VA-to-PA (uncached blocks) mappings

• Global inverted page table (GIPT): CA-to-PA mappings (for all cached blocks)

cTLB

Die-stacked DRAM cache

L1/L2Cache

Processor Pipeline

Off-packageDRAM

PageTable

GIPT

On-die

In-package

Off-package


• Case 1: cTLB miss & cache miss

− TLB miss handler performs both page table walk and cache fill

− cTLB holds the newly established VA-to-CA mapping

Cache operations (2): Cache miss path


L1/L2Cache

Processor Pipeline

Off-packageDRAM

VA1

VA1

cTLB

PageTable

VA1

GIPT

CA1 FreePA1

CA1

PA1CA1

VA1

cTLB miss1

Page table walk2

Allocate cache block3

Data Block (4KB)

Data Block (4KB)

Update PTE4

Update cTLB5

PA1

CA1VA1 CA1


Cache operations (3): Cache hit path

• Case 2: cTLB hit & cache hit

− TLB hit guarantees cache hit

− cTLB directly translates VA to CA – achieves lowest hit latency!


L1/L2Cache

Processor Pipeline

Off-packageDRAM

VA1

VA1

cTLB

PageTable

VA1

GIPT

CA1 PA1

CA1

PA1CA1

VA1

CA1

CA1

cTLB hit1

Data Block (4KB)CA1


Cache operations (4): Victim cache hit path

• Case 3: cTLB miss & cache hit

− Typically, cTLB reach is smaller than DRAM cache size

− DRAM cache space beyond cTLB reach is used as victim cache for those blocks

recently evicted from cTLB

CA0

cTLB

Page Table

CA1

CA2

CA3

CA4

CA5

CA1VA1

CA4VA4

CA0VA0

CA1VA1

CA2VA2

CA3VA3

CA4VA4

CA5VA5

DRAM cache

Cached but not in TLB(Victim block)

What if VA2 is requested?


Cache operations (4): Victim cache hit path

• Case 3: cTLB miss & cache hit (cont’d)

− If PTE has a mapping to CA, the block is currently cached (victim hit)

− No additional latency penalty


L1/L2Cache

Processor Pipeline

Off-packageDRAM

VA2

VA2

cTLB

PageTable

VA2

GIPT

CA2 PA2

CA2

CA2

VA2

CA2

CA2

cTLB miss1

Page table walk2 Update cTLB3VA2 CA2 Data Block (4KB)CA2


Cache operations (5): Cache block eviction

• Case 4: Asynchronous cache block eviction

− OS ensures a small number of free blocks always available

− Takes cache block writeback request off from the common access path


L1/L2Cache

Processor Pipeline

Off-packageDRAM

VA1

cTLB

PageTable

VA5

GIPT

CA5 PA5

CA1

CA5PA5 Free

Write back or drop2

Lock the PTE using PTE pointer1

Update PTE and unlock3

Data Block (4KB

Data Block (4KB)

PA5

CA5


• Case 5: cTLB hit & cache miss

− Only occurs to pages declared to be non-cacheable (e. g., using mmap)

− cTLB entry holds VA-to-PA mapping and memory request bypasses DRAM cache

Cache operations (6): Non-cacheable block access


L1/L2Cache

Processor Pipeline

Off-packageDRAM

VA3

VA3

cTLB

PageTable

VA3

GIPT

CA1 PA1PA1PA3

VA3

PA3

PA3

PA3cTLB hit1

Data Block (64B)

Data Block (64B)

PA3

PA3


Outline



− Cache Operations

− Cache Organization

− Summary

• Evaluation

• Conclusion


Cache organization (1): Page table entry (PTE)

• PTE has three additional flags

− Valid-in-cache (VC): indicates whether the page is currently cached or not

− Non-cacheable (NC): indicates whether the page bypasses the DRAM cache or not

− Pending Update (PU): indicates that the PTE will soon be updated

VA1 PA1PA6VA2 PA1PA2VA3 PA1CA3VA4 PA1CA5VA5 PA1PA1VA6 PA1PA4

FlagsFlagsFlagsFlagsFlagsFlags

Page Table

Physical Page NumberOr Cache Page Number

PU

NC

VC

0: Not cached yet1: Cached in DRAM cache

0: Cacheable 1: Non-cacheable (bypass DRAM)

0: Ready1: Busy (PTE being updated)

Indexed by VA


Cache organization (2): GIPT entry

• GIPT entry has three fields

− Physical page number (PPN): maintains CA-to-PA mapping

− Page table entry pointer (PTEP): points to the corresponding PTE

− TLB residence bit vector: indicates which core holds this page in its TLB

• Victim for eviction is selected among those not residing in any TLB

− To prevent cache block eviction from triggering TLB shootdowns

− FIFO policy (by default)

CA1 PA1PA1CA2 PA1PA3CA3 PA1PA4CA4 PA1PA5

PTEPPTEPPTEPPTEP

GIPTTLBsTLBsTLBsTLBs

Physical Page Number(Physical Address)

Page Table Entry Pointer(PTEP)

TLB residence bit vector

36bit 42bit Number of TLBs

Indexed by CA


Cache organization (3): GIPT vs. cache tag array

• Tagless DRAM cache eliminates tags, but introduces GIPT

− GIPT has much lower cost than cache tag array!

Cache tag array GIPT

Access frequencyAt every cache access(for both hit and miss)

At cache miss only

Storage overhead(For 8GB DRAM

cache)

16MB On-die SRAM

[Baseline]

48MB On-die SRAM

[ISCA’13]

256MB In-package

DRAM[MICRO’14]

20MB Off-package

DRAM[This Work]

[MICRO’14] Jevdjic et al., Unison cache: A scalable and effective die-stacked DRAM cache[ISCA’13] Jevdjic et al , Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache


Summary (1): Latency breakdown

• Tagless cache vs. SRAM-tag cache: Latency comparison

TLB On-die cache SRAM tag

cTLB On-die cache

TLB On-die cache SRAM tag

cTLB On-die cacheGIPT update

PT walks

PT walks

Saved

Penalty

Hit latency (common case)

Miss latency (rare case)

SRAM-tag

Tagless

SRAM-tag

Tagless

Two memory accesses

Reduces hit time at the cost of slight increase in miss penalty

Drawn not to scale


Summary (2): Comparison of different cache designs

• Design requirements for large DRAM cache

• Over-fetching problem can be mitigated by adopting:

− Singleton prediction [MICRO‘14]

− Hot/cold page tracking [HPCA’10]

Design Requirement Block-based(64B/block)

Page-based(4KB/block)

This work(4KB/block)

Small tag storage Bad Good Best

High hit ratio Bad Good Best

Low hit latency Bad Good Best

High DRAM row buffer locality Bad Good Good

Minimal over-fetching Good Bad Bad

[MICRO’14] Jevdjic et al., Unison cache: A scalable and effective die-stacked DRAM cache[HPCA‘10] Jiang et al. CHOP: Adaptive filter-based dram caching for CMP server platforms


Outline



• Evaluation

− Methodology

− Single-programmed workloads

− Multi-programmed workloads

• Conclusion


Evaluation methodology (1)

• Simulator: McSimA+ [ISPASS’13]

• DRAM timing and energy parameters [SC’14]

Component Parameters

CPU Out-of-Order, 4 cores, 3GHz

L1/L2 TLB 32I/32D entries per core / 512 entries per core

L1/L2 Cache 4 way, 32KB/32KB per core / 16 way, 2MB per core,

SRAM-tag 16 way, 256K entries, 11cycle

In-package DRAM(1GB)

DDR 3.2GHz, 1 channel, 2 ranks16 banks per rank, 128 bits per channel

Off-package DRAM(8GB)

DDR 1.6GHz, 1 channel, 2 ranks64 bank per rank, 64 bits per channel

Parameters DRAM cache

Off-package DRAM

I/O energyRD or WR energy without I/OACT+PRE energy (4KB page)

2.4pJ/b4pJ/b15nJ

20pJ/b13pJ/b15nJ

Activate to read delay (tRCD)Read to first data delay (tAA)

Activate to precharge delay (tRAS)Precharge command preiod (tRP)

8ns10ns22ns14ns

14ns14ns35ns14ns

[ISPASS’13] Ahn et al., McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling[SC ‘14] Son et al. Microbank: architecting through-silicon interposer-based main memory systems


Evaluation methodology (2)

• Workloads

− Single-programmed workloads: SPEC CPU 2006

− Multi-programmed workloads: SPEC CPU 2006

− Multi-threaded workloads: PARSEC (not shown)

• Comparison of five designs

− No L3 cache (No L3): Baseline system

− Bank-Interleaving (BI): Use 3D DRAM as part of memory (bank-interleaved)

− SRAM-tag (SRAM): Page-based DRAM cache with SRAM tags

− Tagless (cTLB): Our proposed tagless DRAM cache

− Ideal: Ideal DRAM cache model


Single-programmed workloads: IPC and EDP

• IPC improvements: 16.4% over No L3, 7.3% over SRAM-tag

• EDP improvements: 35.8% over No L3, 26.5% over SRAM-tag

Geometric mean of CPU2006 benchmarks (11 workloads)

[410.bwaves, 429.mcf, 433.milc, 437.leslie3d, 450.soplex, 459.GemsFDTD, 462.libquantum, 470.lbm, 471.omnetpp, 482.sphinx3, 483.xalancbmk]

No L3


Single-programmed workloads: Average L3 latency

• Up to 16% lower than SRAM-tag (462. libquantum)

− Average reduction of 10%

Hit rate: ~99%

In-package DRAM access time

Geo

mea

n


Multi-programmed workloads

• IPC improvements: 24.9% over No L3, 2.6% over SRAM-tag

• EDP improvements: 41.5% over No L3, 21.3% over SRAM-tag

Geometric mean of CPU2006 benchmarks (8 groups)


Summary

• We introduce the tagless DRAM cache featuring:

− Lowest hit latency with no tag checking on hit path

− Low average memory access time due to low hit latency and full associativity

− Zero tag storage overhead from either on-die SRAM or in-package DRAM

− High energy efficiency with no energy waste from tags

− Flexibility to plug in a new caching policy by simply modifying TLB miss handler

• The proposed tagless DRAM cache effectively eliminates one of the

most serious scalability bottlenecks for the future multi-gigabyte

DRAM caches.

28

Thank you


More in paper

• Detail configuration

• Non-cacheable flag

−Minimize over-fetching problems

−Shared page support

−Superpages

• TLB residence bit vector

−Reduce TLB shootdown

• Multi-socket support

• More Results

Documents

A Fully Associative, Tagless DRAM Cache Yongjun Lee † Jongwon Kim † Hakbeom Jang † Hyunggyun Yang ‡ Jangwoo Kim ‡ Jinkyu Jeong † Jae W. Lee † June 15 th