[IEEE 2010 International Conference on Energy Aware Computing (ICEAC) - Cairo, Egypt (2010.12.16-2010.12.18)] 2010 International Conference on Energy Aware Computing - Energy efficient

Energy Efficient DRAM Row Buffer Management for Enterprise Workloads

Karthik Kumar School of Electrical and Computer Engineering

Purdue University, West Lafayette, IN, USA [email protected]

Martin Dimitrov, Kshitij Doshi Software and Services Group

Intel Corporation, Chandler, AZ, USA {martin.p.dimitrov, kshitij.a.doshi}@intel.com

Abstract—— This paper establishes that adapting the DRAM row buffer policy in accordance with both locality and activity in memory access patterns can lead to significant reductions in DRAM energy. In particular, using two different enterprise workloads, we establish that a spatially and temporally adaptive row buffer timeout policy is more energy-efficient than any static policy. Our results show that energy can be reduced by 6% and 30% for the two enterprise workloads.

I. INTRODUCTION In recent years, climate concerns and rising costs for

powering and cooling have brought a sharp focus on the design of energy efficient computing systems. Among system components, memory (DRAM) is one of the most power-hungry, accounting for 40% of total system power in a mid-range enterprise server [1]. Recent studies have shown that as the memory per core increases, memory power will dominate CPU power consumption [19].

Typical server workloads such as database activity mixes have large memory footprints, and long tailed time- and space- and thread- varying memory access patterns. For example, Figure 1 shows the activity and locality observed for 2 memory ranks for a TPC-C-like workload [20] and the DBT-5

workload [21]. (All further references to TPC-C in this paper refer to a close imitation of TPC-C without its warehouse scaling requirement). In the figure, activity is expressed by the number of memory references. Locality is expressed in terms of the CAS to RAS ratio, a measure of how many successive references are to the same row (see Section II for background on DRAM architecture). Note the substantial variation, -- in the amount of activity and the degree of locality: across differing memory ranks, and as well, within the same rank, as time proceeds. Unfortunately, current DRAM systems are rigid, and opportunity for guiding and adapting energy expenditures based on such variations in memory access patterns -- for example, by adapting physical address mapping, prefetching, and row buffer policies, remains largely unexplored. This paper focuses on the row buffer policy – usually, a static configuration choice that determines how long to keep a row open in memory after it is accessed. Existing row buffer policies focus on improving performance by keeping rows open based on observed locality of memory references. However the relationship between performance and energy is complex – while improving performance does save energy since overall time is reduced, large amounts of power can be

(a) (b) (c) (d)

(e) (f) (g) (h) Figure 1: Activity (number of references) and Locality (CAS-RAS ratio) for two different ranks observed for 5 million memory reference TPC-C and DBT-5 workloads after filtering over a 2MB cache. The x-axis shows time in 105 memory cycles. It is observed that both activity and locality vary over time, and also vary between the ranks.

978-1-4244-8274-0/10/$26.00 ©2010 IEEE

expended by keeping rows open. This is especially the case when a rank of memory has low activity and enters low power modes. Further explained in Section III.A, the example of Figure 1 (showing significant variations in activity and locality) suggests that static row-buffer management would not be energy efficient, and that optimal row-buffer management needs to be adaptive both over time and vary according to locality and utilization of ranks. In this paper, we propose a policy that adapts on a per rank basis using both activity and locality. We implement our adaptive policy using DRAMSim, a cycle accurate memory system simulator [6]. We evaluate our policy using memory access traces representative of TPC-C and DBT-5 workloads and show that our adaptive policy is more energy-efficient than any static policy. Our results show that energy can be reduced by 6% and 30% for the two workloads.

II. BACKGROUND AND RELATED WORK A. Background on DRAM architecture Memory is spatially organized into channels, ranks, banks, rows and columns. On a memory access, first the channel, rank, and bank are identified. Then an entire row (usually 8K bytes) within the bank is loaded onto a set of sense amplifiers called the row buffer, from which the corresponding column is then referenced; each bank has one row buffer (see Figure 2). If a row is kept open and a subsequent reference is to the same open row (row-buffer hit), then performance is improved since the row need not be reloaded onto the row buffer. If the subsequent reference is to a different row – then there is a conflict, and the buffer needs to be flushed before the required row is loaded. Thus, the row buffer management policy is usually a static configuration choice that determines how long to hold a row “open” in the row-buffer after a memory reference. There are three different choices for the row buffer policy – (1) continue to keep the row open in the row buffer anticipating successive references to the same open row (open page policy), (2) close the row immediately anticipating successive references to different rows within the bank (close page policy), and (3) close a row after a timeout interval of several nanoseconds (timeout policy). The open page policy is better suited for workloads with high spatial locality in their reference patterns (such as streaming applications), the close page policy is suited for workloads with low spatial locality (high degree of randomness) in their reference patterns, and the timeout policy offers a tradeoff between the two.

In addition to impacting performance by issuing timely row-open/precharge commands, the row-buffer management policy also impacts the power consumption of memory by enabling or preventing memory from transitioning into deep low-power modes. In particular, in current DDR2-3 memory systems, a rank (a rank is the smallest unit of power management) is able to achieve the best power savings when the row-buffers in all banks have been closed. Table I shows the current parameters for DDR2 memory: when a rank is not powered-down, the current drawn to hold a row open (IDD3N, 45 mA) is the same as the current drawn when the row is closed (IDD2N, 45 mA). However, when a rank is powered-down, the current to hold a row open (IDD3P, 25 mA) is significantly more (5x more) than the current when the row is

Figure 2. Memory organization: organization of banks within a rank.

Each bank (shown in a different shade of blue) has a set of sense amplifiers.

closed (IDD2P, 5mA). Thus, intelligently closing rows by observing rank activity is critical in order to enable deep power modes.

TABLE I. CURRENT PARAMETERS FOR MICRON DDR2 SDRAM MEMORY.

Symbol Current type x8 mAIDD0 Operating current 80

IDD2P Precharge Power-down current 5IDD2N Precharge standby current 45IDD3P Active Power-down current (fast exit) 25IDD3N Active standby current 45IDD4R Operating Read current 145IDD4W Operating Write current 130IDD5A Auto Refresh current 200

B. Related Work Previous work on row-buffer management has typically focused on reducing conflicts and improving performance, for instance by reordering memory references [2], permuting the physical address bits [3], and others (see Table II for a summary of such techniques). More closely related to our work are [4] [5] in which the row-buffer hold times are dynamically changing, in order reduce conflicts and improve performance. A recent memory controller from Intel [18] alternates between open and close page policies based on the observed locality in the memory stream. The common focus of these techniques is to improve performance by observing the locality of memory references. As mentioned earlier, while improving performance does save energy since overall time is reduced, large amounts of power can be expended by keeping rows open. This is especially the case when there are few references to a rank of memory and the rank transitions to the power-down state with an open row (thus missing the opportunity to transition into precharge power-down (IDD2P)). In this study, we focus on energy efficiency, and we consider both activity and locality in a rank in adapting row buffer timeout intervals dynamically.

TABLE II. RELATED WORK: PATENTS AND PUBLICATIONS ON DRAM ROW BUFFER MANAGEMENT.

Reference Contribution[3] Reduce conflicts by breaking address symmetry

[10],[11] Precharge subsets of memory cells to save energy[5],[12],[13] Predict/Reduce conflicts

[14] Provide page open hint in transactions[15] Improve performance by adaptive close[16] Page open/close based on high order address bit[4] Precharge Policy Based on Address Analysis

[17] Aggregates cache lines onto same buffer

978-1-4244-8274-0/10/$26.00 ©2010 IEEE

III. ADAPTIVE ROW BUFFER MANAGEMENT A. Adaptive Policy

Figure 1 exhibits significant spatial (rank-to-rank) variations in the activity and locality of memory access patterns. For instance, we can see from the figure that rank 2 (Figure 1 (g)) has much fewer memory references than rank 1 (Figure 1 (e)) during the execution of the same DBT-5 workload. This is an important observation, because the low activity of rank 2 suggests that this rank is much is more likely to transition to low power states. From Figure 1 we can also observe strong temporal variation in activity and locality within the same rank. For instance, as time progresses, rank 2 shows a large number of memory references (Figure 1 (c)) and a high degree of locality (Figure 1 (d)) in one section of the TPC-C workload. During a subsequent section of the same workload, rank 2 observes very few references with reduced locality. This suggests that rank 2 should keep rows open longer (capture locality) during the first part, and close rows (save energy) more aggressively in the second part. The reverse trend is observed for rank 1 (Figure 1 (a)) where the number of references increases in the second part. Clearly, the spatial and temporal non-uniformity of accesses and locality indicates that a static row-buffer policy is sub-optimal.

In order to construct a timeout policy that is spatially and temporally adaptive, first we divide the workload into different time windows or observation windows. For each observation window, we observe two factors: the activity A given by the number of references and locality L given by the CAS-RAS ratio. Based on these factors we construct four cases for timeouts:

(1) High activity, high locality- timeout τ1 (2) High activity, low locality - τ2 (3) Low activity, high locality - τ3 (4) Low activity, low locality - τ4.

B. Selection of Timeout Parameters We select a large value of timeout for τ1 (≈ length of time

window) because high activity results in many accesses to each row, and power is not lost in holding the row open for longer (since the rank does not power down much due to the high activity). In addition, due the high locality, a large τ1 will help reduce the energy required for redundant precharges. We select the smallest value of timeout for τ4 (≈ 0) because both activity and locality are low, and we want to close the row as soon as possible to save energy. The selection of τ2 and τ3 involve more subtle tradeoffs. For τ2, while a low value may not save power (high A implies lower power-down probability), it preserves performance by accelerating precharges, and by avoiding bank conflicts under low locality. For τ3, although we have high locality, the low activity requires making a tradeoff between a timeout that is large enough to capture bursts of highly local references, and small enough to avoid wasting power (by keeping rows open) when the localized references are spaced out in time. We solve this by mapping τ2 and τ3 dynamically into a range, depending on the magnitude of A and L relative to two fixed calibration points AΔ and LΔ (where we define AΔ to be the threshold number of references in a given time window above which we

modify the timeout, and LΔ to be the threshold locality above which we modify the timeout). Note that the applied timeout itself is a dynamic choice among τ1, τ2, τ3, and τ4.

IV. EVALUATION A. Experimental Setup

We use DRAMSim [6] with a 4GB 667MHz memory system and a 2MB last level cache for our validation. We simulate 4GB DDR2 SDRAM memory system with 2 channels, 4 ranks per channel and 8 banks per rank. We examined the performance/energy of various address maps, which determine how memory addresses are mapped to the physical devices. We found that the default address map in DRAMSim works best under most conditions, and hence we select it for our validation.

The values of timeouts τ1, τ2, τ3, τ4 and thresholds AΔ and LΔ can vary significantly for each memory configuration and workload. Some of the configuration parameters include the address map, the size of the observation window, the access pattern of the workload, etc. In this study, we use an observation window of 10,000 memory cycles. We observe A over a few windows and then select AΔ to be the median value among the observations. For LΔ, we use a constant value of 2: which we observed to work well empirically. We assign the smallest timeout τ4 = 0 and the largest timeout the size of the observation window τ1 = 10,000. For the two intermediate timeouts, we assign values of τ2 = 100 and τ3 =10 respectively. We use 5 million (post-cache) memory references from traces representative of TPC-C and DBT-5 workloads. The first trace is a public domain TPC-C trace obtained from [7], and the second trace is an open source implementation of the TPC-E workload, DBT-5. The DBT-5 trace is a multi-threaded (4 threads) trace running on MySQL and obtained using PIN [8], a dynamic binary instrumentation tool.

B. Experimental Results

We compare the power, performance, and energy in Figure 3. Power is the average power consumed per memory cycle, expressed in Watts. Performance is reflected in the average time taken by the memory system to service a memory reference, expressed in cycles. Energy is given by the total energy expended by the memory system in Joules. Savings are shown normalized with respect to a static timeout of 28K memory cycles, similar to [9].

TPC-C Results: The first bar in Figure 3 shows the TPC-C workload. We observe that as the timeout decreases, from the open page policy to the various static timeouts to the close page policy, the power savings improve due to more time spent in low power modes. The adaptive policy saves the most power (≈ 8%) by avoiding redundant precharges during high locality, and entering low-power modes during low activity. The reverse trend is observed for performance; as the timeout decreases, the performance worsens because premature precharges result in redundant reloads during periods of high locality. The open page and static timeout policies have the best performance; however, they are not the most energy efficient. This underscores the difference between

978-1-4244-8274-0/10/$26.00 ©2010 IEEE

performance and energy; it is important to consider rank activity and conserve power by closing rows during rank inactivity. By doing so, the adaptive scheme saves about 6% energy when compared to the baseline.

DBT-5 Results: The second bar in Figure 3 shows the DBT-5 workload.

Interestingly for this workload, the adaptive scheme saves energy by improving performance. This is because the workload has a significant variation in locality of memory references. When there is high locality, both the static timeouts and the adaptive scheme improve performance; however, when there is low locality, the adaptive scheme reduces the timeout and issues precharges early, avoiding bank conflicts. This approach saves about 30 % energy when compared to the baseline.

Discussion: In some workloads, such as DBT-5, we observed a

material difference in terms of energy between the close page policy and a open page policy with timeout 0. This is because a memory access has three commands: row access (RAS), column access (CAS) and precharge. All the three commands are combined to form a single transaction in the transaction queue; and the transaction queue is common to all ranks within a channel.

In the close page policy, the RAS, CAS, and precharge are part of a single transaction, and thus the precharge takes place immediately after the CAS. For an open-page-with-timeout policy, the RAS and CAS are a single transaction and the precharge is issued separately as a new transaction: a data structure called the page table [18] maintains a timer to check if the timeout has lapsed, and if so, a precharge is issued. In the case of a zero timeout, the timeout has lapsed in the very next cycle; however, the transaction containing the precharge command may not get inserted into the transaction queue until several cycles later: this is due to new transactions being inserted for other ranks before the precharge is inserted into the queue. As a result, the row remains open until the precharge command is issued, causing energy to be expended. Thus in our experiments with the DBT-5 workload, for ranks with consistently very low activity, we switched to closed-page policy instead of an open page policy with timeout 0.

V. CONCLUSION AND FUTURE WORK In summary, in this work we show that adapting the row

buffer policy in accordance with patterns of activity and locality can save 6% and 30% energy for memory references

representative of two enterprise workloads. As part of our future work, we plan to expand the scope of our experiments and include additional workloads, evaluate the impacts of future memory technologies, and confirm results empirically.

ACKNOWLEDGMENT

We would like to thank David Baker and David Ott from Intel for reviewing drafts of our work and providing valuable feedback.

REFERENCES [1] C. Lefurgy et al., “Energy Management for commercial servers,” IEEE

Computer, 2003, pp. 39 – 48. [2] I. Hur et al., “ Adaptive History based Memory Schedulers,” MICRO,

2004, pp. 343 – 354. [3] Z. Zhang et al., “A permutation-based page interleaving scheme to

reduce row-buffer conflicts and exploit data locality,” MICRO, 2000, pp. 32 – 41.

[4] C. Ma et al., “A DRAM Precharge Policy Based on Address Analysis,” Euromicro Conference, 2007, pp. 244–248.

[5] R. Isaac et al., “Dynamic page conflict prediction for DRAM,” 2006, Patent 7,133,995.

[6] D. Wang et al., “DRAMsim: A Memory System Simulator,” ACM SIGARCH Comp. Arch. News vol. 33, no. 4, 2005, pp. 107-112.

[7] BYU Trace Distribution Center, http://tds.cs.byu.edu/tds/ [8] C.-K. Luk et al., “Pin: Building customized program analysis tools

with dynamic instrumentation,” in PLDI, 2005, pp. 190–200. [9] Intel Corporation, Intel 82875P Memory Controller Hub, 2004. [10] M. RAO GR, “Memories with selective precharge,” 2008, WO Patent

006,081. [11] H. David, “Partial bank DRAM precharge,” 2008, Patent 7,392,339. [12] P. Madrid, “Detection of Speculative Precharge,” Feb. 26 2009, WO

Patent WO/2009/025,712. [13] Y. Xu et al., “Prediction in Dynamic SDRAM Controller Policies,”

Embedded Computer Systems: Architectures, Modeling, and Simulation, 2009, pp. 128–138.

[14] J. Cho et al., “Page open hint in transactions,” 2002, Patent 10/323,381.

[15] J. Dodd et al., “Adaptive page management,” 2003, Patent 10/676,781. [16] R. Farrell et al., “Page open/close scheme based on high order address

bit and likelihood of page access,” 1997, Patent 5,664,153. [17] K. Sudan et al., “ Micro-Pages: Increasing DRAM Efficiency with

Locality-Aware Data Placement,” ASPLOS, 2010, pp. 219-230. [18] Intel Corporation, Xeon® Processor 5500 Series Datasheet, 2009. [19] M Tolentino et al., “Memory MISER: Improving Main Memory

Energy Efficiency in Servers,” IEEE Transactions on Computers, 2009, pp. 336-350.

[20] Transaction Performance Processing Council, http://www.tpc.org/ [21] Open Source Development Labs, http://osdldbt.sourceforge.net/

Figure 3. Power, Performance, and Energy Savings for different row buffer management policies using the TPC-C and DBT-5 workloads. Results are shown normalized to a static policy with a timeout of 28,000 memory cycles.

978-1-4244-8274-0/10/$26.00 ©2010 IEEE

Documents

[IEEE 2010 International Conference on Energy Aware Computing (ICEAC) - Cairo, Egypt (2010.12.16-2010.12.18)] 2010 International Conference on Energy Aware Computing - Energy efficient