Korea University, VLSI Signal Processing Lab. Jinil Chung (  •§„‌¼ ) ( jinil_chung@korea.ac.kr )

  • View
    67

  • Download
    0

Embed Size (px)

DESCRIPTION

[Paper Review]. Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis + , Jeffrey Stuecheli *+ , and Lizy Kurian John + MICRO’11. + The University of Texas at Austin * IBM Corp. Korea University, VLSI Signal Processing Lab. - PowerPoint PPT Presentation

Text of Korea University, VLSI Signal Processing Lab. Jinil Chung (  •§„‌¼...

1

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core EraDimitris Kaseridis+, Jeffrey Stuecheli*+, and Lizy Kurian John+MICRO11Korea University, VLSI Signal Processing Lab.Jinil Chung () (jinil_chung@korea.ac.kr)

+ The University of Texas at Austin* IBM Corp.[Paper Review]jinil_chung@korea.ac.kr1Good morning, everyone.Im Jinil Chung from VLSI Signal Processing Lab.Abstract

[IEEE Spectrum(link)]DRAM: balance between performance, power, and storage density To realize good performance,Must mange the structural and timing restrictions of the DRAM devicesUse of Page-mode feature can mitigate many DRAM constraintsAggressive page-mode results in many conflicts (e.g. bank conflict) when multiple workloads in many-core systems map to the same DRAMIn this paper, Minimalist approachjust enough page-mode accesses to get benefits, avoiding unfairness Proposed address hashing + data prefetch engine + per request priority( # )jinil_chung@korea.ac.krDRAM is used mainly in the M/M.DRAM 21. Introduction

Row buffer (or page-mode) AccessThis paper proposed combination of open/closed-page policy based on Page-mode gain with only a small number of page accesses Propose a fair DRAM address mapping scheme: low RBL & high BLPPage-mode hit with spatial locality which can be captured in prefetch engines Propose an intuitive criticality-based memory request priority schemeOpen-page policyClosed-page policyPage-mode gainReducing row access latencyNone(single col. access per row activation)Multiple requests in many core systemIntroducing priority inversion and fairness/starvation problemsAvoiding complexities of row buffer managementRBL: Row-buffer LocalityBLP: Bank-level ParallelismNOT temporal locality!( # )jinil_chung@korea.ac.krWhile DRAM devices output only 16-64bits per request (depending on the DRAM type and burst settings), internally, the devices operate on much larger, typically 1KB pages (also referred to as rows). As shown in Figure, each DRAM array access causes all 1KB of a page to be read into an internal array called Row Buffer, followed by a column access to the requested sub-block of data. Since the read latency and power overhead of the DRAM cell array access have already been paid, accessing multiple columns of that page decreases both the latency and power of subsequent accesses. These successive accesses are said to be performed in page-mode and the memory requests that are serviced by an already opened page loaded in the row buffer are characterized as page hits.32. BackgroundDRAM timing constraint results in dead time before and after random access MC(Memory Controller)s job is to reduce performance-limiting gaps using parallelism

1) tRC (row cycle time; ACT-to-ACT @same BK): MC activates a page wait for tRC @same BK: multiple threads access diff. row @same BK latency overhead (tRC delay)

2) tRP (row precharge time; PRE-to-ACT @same BK): In open-page policy, MC activates other page tRP penalty @same BK (=close current page before new page is opened)ACTPREACTtRP (e.g. 12ns)tRC (e.g. 48ns)tRAS (e.g. 36ns)@same bank( # )jinil_chung@korea.ac.kr4

3. MotivationUse of page-mode Latency Effects: Due to tRC & tRP, overall latency increase small # of access?Power Reduction: only Activate Power reduction small # of access is enoughBank Utilization: drop off quickly as access increase small # of access is enoughOther DRAM complexities: small # of access is needed for soften restrictions ex) tFAW (Four page Activate time Window; 30ns), cache block transfer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%)Closed-page policyClosed-page policyIf B/U is high, the probability that new request will conflict w/ a busy bank is greater.16%62%Next page( # )jinil_chung@korea.ac.kr53. Motivation

3.1 Row-buffer locality in Modern Processors : in current WS/Server class designs large last-level cache (e.g. IBM PowerPC 7)

RBL: Row-buffer LocalityTemporal locality: hits to the large Last-level cacheRow buffers exploit only Spatial localityUsing prefetch engines, It can be predict spatial locality ( # )jinil_chung@korea.ac.krIn this section we describe our observations regarding page-mode accesses as seen in current workstation/server class designs.Contemporary CMP processor designs have evolved to impressive systems on a chip. Many high performance processors (eight in current leading edge designs) are backed by large last-level caches containing up to 32 MB of capacity [PowerPC 7]. A typical memory hierarchy that includes the DRAM row buffer is shown in Figure. As a large last-level cache filters out requests to the memory, row buffers inherently exploit only spatial locality. Applications temporal locality results in hits to the much larger last-level cache.Access patterns with high levels of spatial locality, which miss in the large last level cache, are often very predictable.In general, speculative execution and prefetch algorithms can be exploited to generate memory requests with spatial locality in dense access sequences. Consequently, the latency benefit of page-mode is diminished.63. Motivation

3.2 Bank and Row Buffer Locality Interplay with Address Mapping

-. DRAM device address: row, column, and bankWorkload A: long sequential access seq.Workload B: single operationWorkload A: higher priority Slow B0Workload B: higher priority Slow A4High BLP (Bank-level Parallelism) B0 can be serviced w/o degrading traffic to the workload Ae.g. FR-FCFSe.g. ATLAS, PAR-BSe.g. Minimalist(DRAM all col. low order real addr.)(DRAM col. & bank low order real addr.)(DRAM all col. low order real addr.)( # )jinil_chung@korea.ac.krThe mapping of the real memory address into the DRAM device address (row, column, bank) has a very significant contribution into memory system behavior. Mapping the spatial locality of request streams to memory resources is the dominant concern.74. Minimalist Open-page Mode

7-bit5-bit2-bit4.1 DRAM Address Mapping Scheme For sequential access of 4 cache lines -. The basic difference that the Column access bits are split in two places. +. 2 LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits

-. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits reducing row buffer conflict [Zhang et al./MICRO00]( # )jinil_chung@korea.ac.krThe above combination of bit selection allows workloads, especially streaming, to distribute their accesses to multiple DRAM banks; improving bank-level parallelism and avoiding over-utilization of a small number of banks that leads to thread starvation and priority inversion in multi-core environments.84. Minimalist Open-page Mode4.2 Data Prefetch Engine [IBM PowerPC 6]

: predictable page-mode opportunities need for accurate prefetch engine : each core includes HW prefetcher w/ prefetch depth distance predictor

1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource( # )jinil_chung@korea.ac.kr94. Minimalist Open-page Mode

4.3 Memory Request Queue Scheduling Scheme

: In OOO execution, the importance of each request can vary both between andwithin applications need for dynamic priority scheme

1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval time-based -. 2 categories: read (normal) and prefetch read request is higher priority -. MLP information from MSHR in each core: many misses less important -. Distance information from Prefetch engine (4.2)MLP: Memory Level ParallelismMSHR: Miss Status Holding RegisterRead request( # )jinil_chung@korea.ac.kr104. Minimalist Open-page Mode

4.3 Memory Request Queue Scheduling Scheme (cont.)

2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge increasing command BW

3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request very critical request can be serviced w/ the smallest latency

4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue) causing minimal write instructions ( # )jinil_chung@korea.ac.krThe rules in Priority Rules 1 summarize the per-request scheduling prioritization scheme that is used in the Minimalist Open-page scheme.115. Evaluation

-. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset-. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment-. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory scheduler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline( # )jinil_chung@korea.ac.kr8 core CMP system using the Simics functional model [13] extended with the GEMS toolset [15]OOO processor model from GEMS along w/ in-house detailed memory subsystem model, GEMS hardware prefetching engine (4.2 ) memory controller model DDR3 1333MHz DRAM simulate using memory controller policy for each experiment.Table 2 full-system simulation parameters Table 3 DDR3 parameter we modeled in our toolset.Table 4 selected workload set w/ their peak BW use. DDR DRAM timing constraint utilization 70% ?100M warm-up cache MC, 100M simula