Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) ( [email protected] )

[email protected]

Minimalist Open-page: A DRAM Page-mode Scheduling Policy

for the Many-core EraDimitris Kaseridis+, Jeffrey Stuecheli*+, and Lizy Kurian John+

MICRO’11

Korea University, VLSI Signal Process-ing Lab.

Jinil Chung ( 정진일 ) ([email protected])

+ The University of Texas at Austin* IBM Corp.

[Paper Review]

mailto:[email protected]

[email protected] ( 2 )

Abstract

[IEEE Spectrum(link)]

DRAM: balance between perfor-mance, power, and storage density To realize good performance,

Must mange the structural and tim-ing restrictions of the DRAM de-

vices

Use of “Page-mode” feature can mitigate many DRAM

constraints

Aggressive page-mode results in many conflicts (e.g. bank con-

flict) when multiple workloads in many-core systems map to the

same DRAM

In this paper, Minimalist approach“just enough” page-mode accesses to get benefits, avoiding un-

fairness Proposed address hashing + data prefetch engine + per re-

quest priority

http://spectrum.ieee.org/semiconductors/memory/the-quest-for-a-universal-memory


1. Introduction

Row buffer (or “page-mode”) Access

This paper proposed combination of open/closed-page policy based on …

1) Page-mode gain with only a small number of page accesses Propose a fair DRAM address mapping scheme: low RBL & high BLP

2) Page-mode hit with spatial locality which can be captured in prefetch engines

Propose an intuitive criticality-based memory request priority scheme

Open-page pol-icy

Closed-page pol-icy

Page-mode gain

Reducing row ac-cess latency

None(single col. access per row activation)

Multiple requests in many core sys-

tem

Introducing priority inversion and fair-

ness/starvation problems

Avoiding complexi-ties of row buffer

management

RBL: Row-buffer Local-ityBLP: Bank-level Paral-lelism

NOT temporal local-ity!


2. BackgroundDRAM timing constraint results in “dead time” before and after random access MC(Memory Controller)’s job is to re-duce performance-limiting gaps using

parallelism1) tRC (row cycle time; ACT-to-ACT @same

BK): MC activates a page wait for tRC @same

BK: multiple threads access diff. row @same

BK latency overhead (tRC delay)

2) tRP (row precharge time; PRE-to-ACT @same BK)

: In open-page policy, MC activates other page tRP penalty @same BK (=close cur-

rent page before new page is opened)ACT

PRE

ACT

tRP (e.g. 12ns)

tRC (e.g. 48ns)

tRAS (e.g. 36ns)

@same bank


3. MotivationUse of “page-mode” …

1) Latency Effects: Due to tRC & tRP, overall latency increase small # of access?

2) Power Reduction: only Activate Power reduction small # of access is enough

3) Bank Utilization: drop off quickly as access increase small # of ac-cess is enough

4) Other DRAM complexities: small # of access is needed for soften re-strictions

ex) tFAW (Four page Activate time Window; 30ns), cache block trans-fer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%)

Closed-page pol-icy

Closed-page pol-icy

If B/U is high, the probabil-ity that new request will conflict w/ a busy bank is greater.

16%

62%

Next page


3. Motivation

3.1 Row-buffer locality in Modern Pro-cessors : in current WS/Server class designs large last-level cache (e.g. IBM PowerPC 7)

RBL: Row-buffer Local-ity

Temporal locality: hits to the large Last-level cache

Row buffers exploit only Spatial locality

Using prefetch engines, It can be predict spatial lo-cality


3. Motivation3.2 Bank and Row Buffer Locality Interplay with Address Mapping

-. DRAM device address: row, column, and bank

Workload A: long sequential ac-cess seq.Workload B: single operation

Workload A: higher priority Slow B0

Workload B: higher pri-ority Slow A4

High BLP (Bank-level Parallel-ism) B0 can be serviced w/o de-grading traffic to the workload A

e.g. FR-FCFS

e.g. ATLAS, PAR-BS

e.g. Minimal-ist

(DRAM all col. low order real addr.)

(DRAM col. & bank low order real addr.)

(DRAM all col. low order real addr.)


4. Minimalist Open-page Mode

7-bit 5-bit 2-bit

4.1 DRAM Address Mapping Scheme

For sequential ac-cess of 4 cache lines

-. The basic difference that the Column access bits are split in two places. +. 2 LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits

-. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits reducing row buf -fer conflict [Zhang et al./MICRO’00]


4. Minimalist Open-page Mode4.2 Data Prefetch Engine [IBM PowerPC 6]

: predictable “page-mode” opportunities need for accurate prefetch engine : each core includes HW prefetcher w/ prefetch depth distance predictor

1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource


4. Minimalist Open-page Mode4.3 Memory Request Queue Scheduling Scheme

: In OOO execution, the importance of each request can vary both be-tween andwithin applications need for dynamic priority scheme

1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval time-based -. 2 categories: read (normal) and prefetch read request is higher priority -. MLP information from MSHR in each core: many misses less impor-tant -. Distance information from Prefetch engine (4.2)

MLP: Memory Level ParallelismMSHR: Miss Status Holding Register

Read re-quest


4. Minimalist Open-page Mode4.3 Memory Request Queue Scheduling Scheme (cont.)

2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge increasing command BW

3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request very critical request can be serviced w/ the smallest latency

4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue) causing minimal write instructions


5. Evaluation-. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset-. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment-. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory sched-uler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline


5. Evaluation5.1 Throughput -. Overall, “Minimalist Hash+Priority" demonstrated the best through-put improvement over the other schemes, achieving a 10% improvement. -. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8% throughput improvements over the whole workload suite.


5. Evaluation5.2 Fairness -. Minimalist improves fairness up to 15% with an overall improvement of 7.5%, 3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.


5. Evaluation5.3 Row Buffer Access per Activation -. The observed page-access rate for the aggressive open-page policies fall significantly short The high page hit rate is simply not possible given the interleaving of requests between the eight executing pro-grams. -. With the Minimalist scheme, the achieved page-access rate is close to 3.5, compared to the ideal rate of four.


5. Evaluation5.4 Target Page-hit Count Sensitivity -. The Minimalist system requires a target number of page hits to be selected that indicates the maximum number of pages hits the scheme attempts to achieve per row activation. -. a target number of 4 pages hits provides the best results. (that different system configuration may shift the optimal page-mode hit count.)


5. Evaluation5.5 DRAM Energy Consumption -. To estimate the power consumption we used the Micron power calcu-lator -. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist Hash+Priority" provide a small decrease of approximately 5% to the overall energy consumption. -. The energy results are essentially a balance between the decrease in page-modehits (resulting in high DRAM activation power) and the increase in sys-tem performance (decreasing runtime).


Conclusions

Minimalist Open-page memory scheduling policy -. Page-mode gain w/ small number of page accesses for each page ac-tivation -. Assign per-request priority using request stream information in MLP and data prefetch engine

Improving throughput and fairness -. Throughput increased by 10% on average (compared to FR-FCSC) -. No need for thread based priority information -. No need for communication/coordination among multiple MC or OS


Appendix. Detailed simulation informa-tion






Thanks,

Documents

Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) ( [email protected] )