25
Scalable Cache Miss Handling For High MLP James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

Scalable Cache Miss Handling

For High MLP

James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

Page 2: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 2 of 25

Introduction

  Checkpointed processors are promising superscalar architectures"  Runahead, CPR, Out-of-order commit, CFP, CAVA

  Deliver high numbers of in-flight instructions"  Effectively hide long memory latencies   Dramatically increase Memory-Level Parallelism (MLP)

Current miss handling structures are woefully under-designed!

Page 3: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 3 of 25

Miss Handling Architecture (MHA)

L1 Cache

MHA

Core

MSHR file

Entry

Cache hierarchy

Subentry

Primary Miss Secondary Miss

• Register in processor • Block offset • Type (rd/wr) • Data (or pointer)

Kroft, ISCA’81

Farkas & Jouppi, ISCA’94 Scheurich & Dubois, SC’88

Cache Miss!

MSHR = Miss Information/Status Holding Registers

Primary Secondary

Page 4: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 4 of 25

Background on MHA

  Kroft [ISCAʼ81] proposed first non-blocking cache"  MSHR file

L1 Cache

MSHR file

L1 Bank

L1 Bank

L1 Bank

L1 Bank

MSHR File

MSHR File

MSHR File

MSHR File

Processor Processor

Unified MHA Banked MHA

  Sohi and Franklin [ISCAʼ91]"  Evaluated cache bandwidth   MSHR file banked with cache

Page 5: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 5 of 25

Motivation

  MHAs must support many more misses"  Brute force approach will not do"

L1 Cache

MSHR file

L1 Bank

L1 Bank

L1 Bank

L1 Bank

MSHR File

MSHR File

MSHR File

MSHR File

Processor Processor

Unified MHA Banked MHA

Centralized design has low bandwidth Banking may cause access imbalance (and lockup) or inefficient area usage

Imbalance induced processor stall

Page 6: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 6 of 25

Proposal: Hierarchical MHA

  A small per-bank MSHR file with Bloom filter"  High bandwidth

  A larger, Shared MSHR file"  High effective capacity   Low lock-up time

L1 Bank

MSHR File

Processor

MHA

Shared MSHR

File

Bloom Filter

L1 Bank

MSHR File

Bloom Filter

L1 Bank

MSHR File

Bloom Filter

Page 7: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 7 of 25

Contributions

  Show that state-of-the-art designs are a significant bottleneck"

  Propose a Hierarchical MHA to meet high MLP demands"

  Thoroughly evaluate on Checkpointed processors with SMT and show"  Over state-of-the-art, avg. speed-ups of 32% to 95%   Over large Unified design, avg. speed-ups of 1% to 18%   Performs close to unlimited size MHA

Page 8: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 8 of 25

Why not reuse load/store queue state?

  High MLP: need state in LSQ and in MHA"  Could simplify MHA by leveraging complex LSQ "

  Allocate MSHR on primary miss   Keep all secondary miss state in LSQ

  Disadvantage of leveraging LSQ"  Induces additional global searches in the LSQ from the cache side

  Searches would use MSHR ID or line address---not word address"  Some checkpointed microarchitectures speculatively retire instructions

and discard LSQ state   LSQ is timing critical: better not put restrictions on it

  We keep primary and secondary miss info in MHA and rely on no specific LSQ design "

Page 9: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 9 of 25

Outline

  Requirements of new MHAs"  Hierarchical MHA"  Experimental setup and evaluation"

Page 10: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 10 of 25

Requirements for the new MHAs

  High capacity"Conventional Checkpointed

Page 11: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 11 of 25

Requirements for the new MHAs

  High capacity"  High bandwidth"

 Average increase of 30%

Page 12: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 12 of 25

Requirements for the new MHAs

  High capacity"  High bandwidth"

 Average increase of 30%   Banked MHAs may suffer from access imbalance

lockups"  From 15% to 23% slow down

  Need many entries and subentries"  32 Entries (primary misses)   16 to 32 subentries (secondary misses)

These are our design goals

Page 13: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 13 of 25

Outline

  Requirements of new MHAs"  Hierarchical MHA"  Experimental setup and evaluation"

Page 14: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 14 of 25

Hierarchical MHA

L1 Bank

Dedicated MSHR

File

Processor

MHA

Shared MSHR

File

Bloom Filter

L1 Bank

Dedicated MSHR

File

Bloom Filter

L1 Bank

Dedicated MSHR

File

Bloom Filter

Allocate in Dedicated

File

Displace to Shared file and

Bloom filter

Bloom filter averts Shared file

accesses

File is Full!

Secondary miss will often hit

in Dedicated file

Page 15: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 15 of 25

Hierarchical meets design goals

  Infrequent L1 lock-up while using MHA area efficiently "  Use Shared file for

displacements

  High bandwidth"  Per-bank Dedicated file   Allocate in Dedicated file

  Locality ensures it is in the Dedicated file"

  Bloom filter for Shared file   Averts most useless accesses

to Shared file"  Prevents a bottleneck at the

Shared file"

L1 Bank

MSHR File

Processor

MHA

Shared MSHR

File

Bloom Filter

L1 Bank

MSHR File

Bloom Filter

L1 Bank

MSHR File

Bloom Filter

Page 16: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 16 of 25

Overall organization and timing

  Dedicated file"  Small and fully pipelined   Few entries and subentries

  Per bank Bloom filter"  Accessed in parallel with

Dedicated file   No false negatives

  Shared file"  Highly associative and unpipelined   Contains many entries and

subentries

L1 Bank

MSHR File

Processor

MHA

Shared MSHR

File

Bloom Filter

L1 Bank

MSHR File

Bloom Filter

L1 Bank

MSHR File

Bloom Filter

Page 17: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 17 of 25

Outline

  Requirements of new MHAs"  Hierarchical MHA"  Experimental setup and evaluation"

Page 18: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 18 of 25

Experimental setup

  5 GHz processor"  5-issue, SMT with 2 contexts"

  Conventional   Checkpointed   LargeWindow (2K entry ROB)

  32 KB L1 Data Cache"  8 banks, 2-way, 64B line, 3 cycle access, 1 port

  Memory bus bandwidth: 15 GB/s"  Workloads: CINT, CFP, Mix"

  SESC simulator (sesc.sourceforge.net)"

Page 19: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 19 of 25

Compare MHAs with the same area

  8%, 15%, and 25% of L1 cache area"  Area estimated using CACTI 4.1   MSHR structures are fully associative

  Unified, Banked, and Hierarchical at each area!  Current: 8 misses like Pentium 4"

L1 Cache

8%

L1 Cache

L1 Cache

15% 25% MHA MHA

MHA

Page 20: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 20 of 25

Performance at 15% area for Checkpointed

  Current is much worse"  Hierarchical is better

than Unified and Banked"  1 to 18% over Unified   10 to 21% over Banked

  Hierarchical is very close to Unlimited"

Page 21: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 21 of 25

Performance at 15% area for other processors

  Conventional!  Less gain across the board

  LargeWindow!  Current bottlenecks the

processor   Hierarchical outperforms the

rest

  Other architectures can leverage this design"

Conventional LargeWindow

Page 22: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 22 of 25

Performance at different area points

  Checkpointed running Mixes"  Unified saturates at 15%"  Banked continues to increase

as it scales up"  Hierarchical is most efficient

for these areas"

Speedup over Banked-15%

Page 23: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 23 of 25

Characterization

  Bloom filter averts majority of Shared file accesses"  On average, from 89% to 95%

  Most secondary misses hit in the Dedicated file"  Reasons for displacing an entry from Dedicated"  No free subentries: 18% to 40%   No free entries: 60% to 82%

Page 24: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 24 of 25

Conclusions

  State-of-the-art MHA designs are a large bottleneck"  Hierarchical speeds-up 32% to 95% over state-of-the-art

  Brute force Unified & Banked designs are suboptimal"  Hierarchical speeds-up 1% to 18% over Unified   Hierarchical speeds-up 10% to 21% over Banked

  Hierarchical performs best over a range of areas"  Additional complexity of Hierarchical is reasonable"

Page 25: Scalable Cache Miss Handling For High MLP...James Tuck The ACM 39th International Symposium on Microarchitecture 3 of 25 Miss Handling Architecture (MHA) L1 Cache MHA Core MSHR file

The ACM 39th International Symposium on Microarchitecture James Tuck 25 of 25

Questions?

James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

Scalable Cache Miss Handling For High MLP