Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev...

Cluster Prefetch: Tolerating On-Chip WireDelays in Clustered Microarchitectures

Rajeev Balasubramonian

School of Computing, University of Utah

July 1st 2004

University of Utah 2

Billion-Transistor Chips

• Partitioned architectures: small computational units connected by a communication fabric

Small computational units with limited functionality fast clocks, low design effort, low power

Numerous computational units high parallelism

The Communication Bottleneck

• Wire delays do not scale down at the same rate as logic delays [Agarwal, ISCA’00][Ho, Proc. IEEE’01]

30 cycle delay to go across the chip in 10 years

1-cycle inter-hop latency in the RAW prototype [Taylor, ISCA’04]

Cache Design

AddressTransfer 6 cyc

Data6 cyc Transfer

6 cyc RAM Access

Centralized Cache

18-cycle access (12 cyclesfor communication)

Cache Design

AddressTransfer 6 cyc

Data6 cyc Transfer

6 cyc RAM Access

Centralized Cache

18-cycle access (12 cyclesfor communication)

L1D L1D

Decentralized Cache

Research Goals

• Identify bottlenecks in cache access

• Design cluster prefetch, a latency hiding mechanism

• Evaluate and compare centralized and decentralized designs

Outline

• Motivation

• Evaluation platform

• Cluster prefetch

• Centralized vs. decentralized caches

• Conclusions

Clustered Microarchitectures

• Centralized front-end

• Dynamically steered (dependences & load)

• O-o-o issue and 1-cycle bypass within a cluster

• Hierarchical interconnect

InstrFetch

crossbar

Simulation Parameters

• Simplescalar-based simulator

• In-flight instruction window of 480

• 16 clusters, each with 60 registers, 30 issue queue entries, and one FU of each kind

• Inter-cluster latencies between 2-10

• Primary focus on SPEC-FP programs

Steps Involved in Cache Access

InstrFetch

Instr Dispatch

Effective Address Computation

Effective Address Transfer

Memory Disambiguation

RAM Access

Data Transfer

Lifetime of a Load

Transf

f inst

r to c

Eff. A

ddr. C

Transf

olutio

Transf

SQ to C

TotalA

Load Address Prediction

L1DLSQ

ClusterEff. Addr. Transfer

Cycle 27

Data TransferCycle 94

Dispatch at cycle 0Cache Access

Cycle 68

Load Address Prediction

L1DLSQ

Cycle 27

L1DLSQ

Cycle 27

AddressPredictor

Cache AccessCycle 68 Dispatch at cycle 0

Cache Access Cycle 0

Memory Dependence Speculation

• To allow early cache access, loads must issue before resolving earlier store addresses

• High-confidence store address predictions are employed for disambiguation

• Stores that have never forwarded results within the LSQ are ignored

Cluster Prefetch: Combination of Load Address Prediction and Memory Dependence Speculation

Implementation Details

• Centralized table that maintains stride and last address; stride is determined by five consecutive accesses and cleared in case of five mispredicts

• Separate centralized table that maintains a single bit per entry to indicate stores that pose conflicts

• Each mispredict flushes all subsequent instrs

• Storage overhead: 18KB

Performance Results

apsi ar

galgel

wupwis

Base case

Ld-addr pred only

St-addr and mem-dep pred onlyLd-addr, st-addr, and mem-dep pred

Overall IPC improvement: 21%

Results Analysis

• Roughly half the programs improved IPC by >8%

• Load address prediction rate: 65%• Store address prediction rate: 79%• Stores likely to not pose conflicts: 59%

• Avg. number of mispredicts: 12K per 100M instrs

Decentralized Cache

Replicated Cache Banks• Loads do not travel far• Stores & cache refills are broadcast• Memory disambiguation is not accelerated

• Overheads: interconnect for broadcast and cache refill, power for redundant writes, distributed LRU, etc.

Comparing Centralized & Decentralized

IPCs without cluster prefetch

1.43 1.52

IPCs with cluster prefetch

1.73 1.79

Sensitivity Analysis

• Results verified for processor models with varying resources and interconnect latencies

• Evaluations on SPEC-Int: address prediction rate is only 38% modest speedups:

twolf (7%), parser (9%) crafty, gcc, vpr (3-4%) rest (< 2%)

Related Work

• Modest speedups with decentralized caches: Racunas and Patt [ICS ’03], for dynamic clustered processors; Gibert et al. [MICRO ’02] , for VLIW clustered processors

• Gibert et al. [MICRO ’03]: compiler-managed L0 buffers for critical data

Conclusions

• Address prediction and memory dependence speculation can hide latency to cache banks; prediction rate of 66% for SPEC-FP and IPC improvement of 21%

• Additional benefits from decentralization are modest

• Future work: build better predictors, impact on power consumption [WCED ’04]

• Bullet

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev...

Documents

Timing Analysis of Embedded Software for Families of Microarchitectures

A Brief History of Intel CPU microarchitectures

TZWorks Prefetch Parser (pf) Users Guide · prefetch file contains information such as: (a) filename, (b) file location, (c) timestamps related to the prefetch entry (created, modified

Exploiting modern microarchitectures: Meltdown, Spectre ... · 3 Exploiting modern microarchitectures: Meltdown, Spectre, and other attacks Overview Today's lecture will cover the

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Heterogeneous Microarchitectures Trump Voltage Scaling …web.eecs.umich.edu/~reetudas/papers/hetero_pact14.pdf · Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power

Data Prefetch Mechanisms - Division of Electrical ... · PDF fileData Prefetch Mechanisms Department of Electrical & Computer Engineering University of Minnesota ... Execution diagram

1406 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, …omutlu/pub/prefetch-aware-memory... · Prefetch-Aware Memory Controllers Chang Joo Lee, Student Member, IEEE, Onur Mutlu, Member,

Web Prefetch

PACMan: Prefetch-Aware Cache Management for High ...caroleje/MICRO11_PACMan_Wu_Final.pdf · PACMan: Prefetch-Aware Cache Management for High Performance Caching Carole-Jean Wu∗

Prefetch-Aware Shared-Resource Management for Multi-Core Systems

Practical Prediction and Prefetch for Faster Access to ...dganesan/papers/Ubicomp13-PPM.pdf · Practical Prediction and Prefetch for Faster Access to ... prediction systems that require

Section 4. Prefetch Cache · 2011. 2. 22. · Prefetch Cache 4 Section 4. Prefetch Cache ... Program Flash Memory (PFM) cache and the Prefet ch Cache module increase performance for

Evaluation of instruction prefetch methods for Coresonic

Band-Pass Prefetching: An Effective Prefetch Management … · of average miss service times of demand to prefetch misses and the ratio of aggregate prefetch (misses)1 to demand misses

Preconnect, prefetch, prerender

Section 4. Prefetch Cache - Microchip Technologyww1.microchip.com/downloads/en/DeviceDoc/61119E.pdf• Predictive prefetching The Prefetch Cache module provides instructions once per

Exploiting modern microarchitectures: Meltdown, Spectre, and … · 2018-02-02 · 3 Exploiting modern microarchitectures: Meltdown, Spectre, and other attacks Overview Today's lecture

Linking Tissue Microarchitectures to Rationalized Molecular Diagnostics in Glandular Cancers

ECAP:Energy Efcient CAching for Prefetch Blocks in Tiled ... · IET Research Journals ECAP:Energy Efcient CAching for Prefetch Blocks in Tiled Chip MultiProcessors. ISSN 1751-8644