Rank-Aware Dynamic Migrations and Adaptive Demotions for ... · periods for power state transitions. We further develop adaptive state demotions by considering all low-power states

arX

iv:1

409.

5567

v1 [

cs.P

F] 1

9 Se

p 20

141

Rank-Aware Dynamic Migrations and AdaptiveDemotions for DRAM Power Management

Yanchao Lu, Donghong Wu, Bingsheng He, Xueyan Tang, Jianliang Xu and Minyi Guo

Abstract —Modern DRAM architectures allow a number of low-power states on individual memory ranks for advanced powermanagement. Many previous studies have taken advantage of demotions on low-power states for energy saving. However, most ofthe demotion schemes are statically performed on a limited number of pre-selected low-power states, and are suboptimal for differentworkloads and memory architectures. Even worse, the idle periods are often too short for effective power state transitions, especiallyfor memory intensive applications. Wrong decisions on power state transition incur significant energy and delay penalties. In this paper,we propose a novel memory system design named RAMZzz with rank-aware energy saving optimizations including dynamic pagemigrations and adaptive demotions. Specifically, we group the pages with similar access locality into the same rank with dynamicpage migrations. Ranks have their hotness: hot ranks are kept busy for high utilization and cold ranks can have more lengthy idleperiods for power state transitions. We further develop adaptive state demotions by considering all low-power states for each rank and aprediction model to estimate the power-down timeout among states. We experimentally compare our algorithm with other energy savingpolicies with cycle-accurate simulation. Experiments with benchmark workloads show that RAMZzz achieves significant improvementon energy-delay2 and energy consumption over other energy saving techniques.

Index Terms —Demotion, Energy consumption, Main memory systems, In-memory processing, Page migrations

✦

1 INTRODUCTION

ENERGY consumption has become a major factor forthe design and implementation of computer sys-

tems. Inside many computing systems, main memory(or DRAM) is a critical component for the performanceand energy consumption. As processors have moved tomulti-/many-core era, more applications run simultane-ously with their working sets in the main memory. Thehunger for main memory of larger capacity makes theamount of energy consumed by main memory approach-ing or even surpassing that consumed by processors inmany servers [1], [2]. For example, it has been reportedthat main memory contributes to as much as 40–46%of total energy consumption in server applications [2],[3], [4]. For these reasons, this paper studies the energysaving techniques of main memory.

Current main memory architectures allow power man-agement on individual memory ranks. Individual ranksat different power states consume different amounts ofenergy. There have been various energy-saving tech-niques on exploiting the power management capabilityof main memory [5], [6], [7], [8]. The common themeof those research studies is to exploit the transitionof individual memory ranks to low-power states (i.e.,demotion) for energy saving. Fan et al. concluded thatimmediate transitions to the low-power state save the

• Bingsheng He and Xueyan Tang are with Nanyang Technological Univer-sity, Singapore. Corresponding author: Bingsheng He, [email protected].

• Yanchao Lu, Donghong Wu and Minyi Guo are with Shanghai Jiao TongUniversity. The work of Yanchao Lu and Donghong Wu is done when theywere visiting students in Nanyang Technological University, Singapore.

• Jianliang Xu is with Hong Kong Baptist University.

most energy consumption for most single-applicationworkloads [9]. However, the decision can be wrongfor more memory intensive workloads such as multi-programmed executions. Huang et al. [6] has shown thatonly sufficiently long idle periods can be exploited forenergy saving because state transitions themselves takenon-negligible amount of time and energy. Essentially,the amount of energy saving relies on the distributionsof idle periods and the effectiveness of how power man-agement techniques exploit the idle periods. Existingtechniques are suboptimal in the following aspects: (1)they do not effectively extend the idle period, either withstatic page placement [9], [10] or with heuristics-basedpage migrations [5], [6]; (2) the prediction on the power-down timeout (the amount of time spent since the beginningof an idle period before transferring to a low-power state) fora state transition is limited and static, either with heuris-tics [5], [6] or regression-based model [9]; (3) most of thedemotion schemes are statically performed on a limitednumber of pre-selected low-power states (e.g., Huang etal. [6] selects two low-power states only, out of five inDDR3). The static demotion scheme is suboptimal fordifferent workloads and different memory architectures.

To address the aforementioned issues, we propose anovel memory design named RAMZzz with rank-awarepower management techniques including dynamic pagemigrations and adaptive demotions. Instead of havingstatic page placement, we develop dynamic page migra-tion mechanisms to exploit the access locality changesin the workload. Pages are placed into different ranksaccording to their access locality so that the pages inthe same rank have roughly the same hotness. As aresult, ranks are categorized into hot and cold ones.

http://arxiv.org/abs/1409.5567v1

2

The hot rank is highly utilized and has very short idleperiods. In contrast, the cold rank has a relatively smallnumber of long idle periods, which is good for powerstate transitions for energy saving.

Instead of adopting static demotion schemes, we de-velop adaptive demotions to exploit the power manage-ment capabilities of all low-power states for individualranks. The decisions are guided by a prediction modelto estimate the idle period distribution. The predictionmodel combines the historical page access frequencyand historical idle period distribution, and is specifi-cally designed with the consideration of page migrationsamong ranks. Based on the prediction model, RAMZzzis able to optimize for different goals such as energysaving and energy-delay2 (ED2). In this paper, we focuson the optimization goal of minimizing ED2 (or energyconsumption) of the memory system while keeping theprogram performance penalty within a given budget.The budget is a pre-defined performance slowdown rel-ative to the maximum performance without any powermanagement (e.g., 10% performance loss).

We evaluate our design using detailed simulationsof different workloads including SPEC 2006 and PAR-SEC [11]. We evaluate RAMZzz in comparison withrepresentative power saving policies [6], [10], [12] andan ideal oracle approach. Our experiments with the opti-mization goal of ED2 (for a maximum acceptable perfor-mance degradation of 4%) on three different DRAM ar-chitectures show that (1) both page migrations and adap-tive demotions well adapt to the workload. Page mi-grations achieve an average ED2 improvement of 17.1–21.8% over schemes without page migrations, and adap-tive demotions achieve an average ED2 improvementof 22.4–36.4% over static demotions; (2) with both pagemigrations and adaptive demotions, RAMZzz achievesan average ED2 improvement of 63.0–64.2% over the ba-sic approach without power management, and achievesonly 3.7–5.7% on average larger ED2 than the ideal oracleapproach. The experiments with the optimization goal ofenergy consumption have demonstrated similar results.

Organization. The rest of the paper is organized asfollows. We introduce the background on basic powermanagement of DRAM and review related work inSection 2. Section 3 gives an overview of RAMZzz de-sign, followed by detailed implementations in Section 4.The experimental results are presented in Section 5. Weconclude this paper in Section 6.

2 BACKGROUND AND RELATED WORK

2.1 DRAM Power Management

In this paper, we use the terminology of DDR-seriesmemory architectures (e.g., DDR2 and DDR3 etc) todescribe our approach. We will evaluate RAMZzz ondifferent DDR-series memory architectures in the ex-periments. DDR is usually packaged as modules, orDIMMs. Each DIMM contains multiple ranks. In powermanagement, a rank is the smallest physical unit that

TABLE 1Power states for three typical DRAM architectures.

Power State Normalized PowerResynchronization

Time (ns)

DDR3 DRx4 at 1333 MHz [15]ACT 1.0 0

ACT PDN 0.612 6PRE PDN FAST 0.520 18

PRE PDN SLOW 0.299 24SR FAST 0.170 768

SR SLOW 0.104 6768

DDR2 DRx8 at 800 MHz [16]ACT 1.0 0

ACT PDN FAST 0.619 5ACT PDN SLOW 0.325 18

PRE PDN 0.237 25SR 0.178 500

LPDDR2 DRx16 at 800 MHz [17]ACT 1.0 0

ACT PDN 0.523 8PRE PDN 0.303 26

SR 0.194 100

we can control. Individual ranks can service memoryrequests independently and also operate at differentpower states. The power consumption of a memory rankcan be divided into two main categories: active powerand background power. Active power consists of thepower that is required to activate the banks and servicememory reads and writes. Background power is thepower consumption without any DRAM accesses. Back-ground power is a major component in the total DRAMpower consumption, and tends to be more significantin the future [6], [13]. For example, Huang et al. [6]found that the background power contributes to 52%of the total DRAM power in their evaluation. Memorycapacity and bandwidth will become larger and is usu-ally provisioned with peak usage, which causes severeunder-utilization [14]. Therefore, we focus on reducingthe background power consumption.

Different power states have different power consump-tions. Entering a low-power state when a rank is idle re-duces the background power consumption. To exit froma low-power state, the disabled hardware componentsneed to be reactivated and the rank needs to be restoredto the active state. State transitions among differentpower states cause latency and energy penalties.

Depending on which hardware components are dis-abled, modern memory architectures support a numberof power states with complicated transitions [15], [16].Each state is characterized with its power consumptionand the time that it takes to transition back to theactive state (resynchronization time). Typically, the lowerpower consumption the low-power state has, the higherthe resynchronization time is. Table 1 summarizes themajor power state transitions of three typical DRAMarchitectures: DDR3, DDR2 and LPDDR2. We do notconsider some advanced power management modes inLPDDR2, like Deep Power-down (DPD) and Partial Ar-ray Self Refresh (PASR), because data retention cannotbe kept when the LPDDR2 enters those states. Foreach state, we show its dynamic power consumption

3

(normalized to that of ACT) and the resynchronizationtimes back to ACT. The power consumption values arecalculated with DRAM System Power Calculator [18].The resynchronization times are obtained from DRAMmanufacturers’ data sheets [15], [16], [17].

From Table 1, we have the following observations onstate demotions on different memory architectures.

First, on a specific memory architecture, power stateshave quite different latency and energy penalties aswell as different power consumptions. Take DDR3 asan example. Pre-charge power-down with fast exit state(PRE PDN FAST) consumes 52% of the power of activeidle state (ACT), with relatively small latency as well asenergy penalties. In contrast, self-refresh with fast exitstate (SR FAST) consumes only 17% of the power ofACT, with much higher latency and energy penalties.The resynchronization time of SR FAST is over an orderof magnitude higher than that of PRE PDN FAST.

Second, different memory architectures have theirown specifications on power states as well as powerstate energy consumption and resynchronization time.First, different memory architectures may have differentsets of power states. For example, DDR3 has a speciallow-power state, i.e., self-refresh with slow exit state(SR SLOW), whereas DDR2 and LPDDR2 do not haveany equivalent state. SR SLOW has a very high resyn-chronization time and consumes only 10% of the powerof ACT. Second, the energy consumption or the resyn-chronization time of the same power state can vary fordifferent memory architectures. Take self-refresh states(SR) as an example. While SR consumes a similar nor-malized power consumption for the three architectures(about 17–19%), the resynchronization time varies sig-nificantly. The resynchronization times on DDR3, DDR2,LPDDR2 are 768ns (SR FAST), 500ns (SR) and 100ns(SR), respectively.

The above-mentioned observations have significantimplications to DRAM power management design.

First, the above-mentioned observations clearly showthe deficiency of the static demotion schemes [5], [6],[7], [9]. The static demotion schemes are performed onthe pre-selected low-power states (even for all ranks inthe same architecture, and for different memory archi-tectures). On a specific memory architecture, the staticdecision loses the opportunities for demoting to themost energy-effective low-power state for different idleperiod lengths. Moreover, since the latency and energyconsumption penalties and power consumption of a low-power state vary with different memory architectures,the static decision loses the opportunities for adaptingto different memory architectures.

Second, because the latency and energy penalty forswitching from deeper low-power states is substantiallyhigher than the penalty of switching from shallowerstates, entering deep power-down states for short idletimes could in fact hurt energy efficiency because thepower savings might not be able to offset the high la-tency penalty of switching back to the active state. Thus,

the effective use of deeper low-power state is contingenton having long idle periods on a rank. That naturallyleads to two problems for reducing background powerconsumption: 1) how to create longer idle periods with-out modifying the application, and 2) how to makecorrect decisions on state transitions.

The design of RAMZzz are guided by the aforemen-tioned two implications. It embraces dynamic migrationsand adaptive demotions, adapting to different work-loads and different memory architectures.

2.2 Related Work

We briefly review the related work on energy savingwith power states and with other hardware and softwareapproaches.

Saving energy by transiting memory power stateshas attracted many research efforts, covering memorycontroller design, compilers and operating systems.

Different power state transition approaches have beendeveloped for DRAM systems. Hur et al. [19] devel-oped adaptive history-based scheduling in the memorycontroller. Based on page migration, Huang et al. [6]stored frequently-accessed pages into hot ranks andleft infrequently-used and unmapped pages on coldranks. Their decisions on page migrations are basedon heuristics. Lebeck et al. [12] studied different pageallocation strategies. Their approach does not have anyanalytical model to guide the decision, or utilize bothrecency and frequency to capture rank hotness. Dinizet al. [10] limited the energy consumption by adjustingthe power states of DRAM. Our prediction model offersa novel way of power management on guiding pagemigrations and power state transitions. Fan et al. [9]developed an analytic model on estimating the idle timeof DRAM chips using an exponential distribution. Theirmodel does not consider page migrations. Kshitij etal. [20] used a similar page migration mechanism be-tween cold and hot ranks, but always set cold ranks witha pre-selected low-power state. Instead of relying onthe presumed knowledge of distribution, our predictionmodel combines the historical information on idle perioddistribution and page access locality. More importantly,compared with all previous studies that pre-define anumber of fixed states for all ranks [6], [9], [10], [12], [19],[20], this paper develops adaptive demotions to exploitthe energy-saving capabilities of all power states, andthe adaptation is on the granularity of individual ranksfor different memory architectures.

DRAM power state transitions have been imple-mented in operating systems and compilers. Delaluzet al. [7] present an operating system based solutionletting the scheduler decide the power state transitions.This approach requires the interfaces of exposing andcontrolling the power states. Huang et al. [5] proposedpower-aware virtual memory systems. For energy ef-ficient compilations, Delaluz et al. [21] proposed com-piler optimizations for memory energy consumption of

4

array allocations. They further combined the hardware-directed approach and compiler-directed approaches [22]for more energy saving.

There are other approaches for reducing the DRAMpower consumption. We review three representative cat-egories. The first category is to reduce the active powerconsumption. Zheng et al. [23] suggested the subdivisionof a conventional DRAM rank into mini-ranks compris-ing of a subset of DRAM devices to improve DRAM en-ergy efficiency. Anh et al. [24] proposed Virtual MemoryDevices (VMDs) comprising of a small number of DRAMchips. Decoupled DIMMs [25] proposed the DRAM de-vices at a lower frequency than the memory channel toreduce DRAM power. The second category is to reducethe power consumption of power state transitions. Bi etal. [26] took advantage of the I/O handling routines inthe OS kernel to hide the delay incurred by memorypower state transitions. Balis et al. [27] proposed finergrained memory state transition. The third category isto adjust the voltage and frequency of DRAM. Mem-ory voltage and frequency scaling (DVFS) is a recentapproach to reduce DRAM energy consumption [28],[29]. Those approaches are complementary to the statetransition-based energy saving approaches.

Recently, different architectural designs of DRAMsystems [13], [24], [30], [31] are explored on multi-core processors for performance, energy, reliability andother issues. Cache-centric optimizations (either cache-conscious [32] or cache-oblivious [33], [34]) reducememory access and create more opportunities for en-ergy saving. Besides optimizations targeting at generalDRAM systems, some researchers have also proposedenergy saving techniques for specific applications suchas databases [8], [35] and video processing [35].

A preliminary version of RAMZzz has been presentedin a previous paper [36]. This paper improves the pre-vious paper in many aspects, with two major improve-ments. First, we have enhanced RAMZzz with adaptivedemotions, which further increases the effectiveness ofstate demotions on individual ranks. Second, we haveevaluated the effectiveness of RAMZzz on differentmemory architectures, and demonstrated the self-tuningfeature of RAMZzz for different workloads and differentmemory architectures.

3 DESIGN OVERVIEW

In this section, we give an overview of the designrationales and workflow of RAMZzz.

3.1 Motivations

Our goal is to reduce the background power of DRAM.Due to the inherent power management mechanisms ofDRAM, there are three obstacles in the effectiveness ofreducing the background power.

First, due to the latency and power penalty of tran-siting from low-power state to active state, it requiresa minimum length threshold for an idle period that is

!"#$"$%#!

&'&()

*+,)-(./ + -(./ +0# -(./ !"#

12.&3 4

5),./+.6 &.67+89:;/+.6

;6< 2:)<+&/+.6

=;8) ,+8:;/+.6

Fig. 1. Overview of RAMZzz.

worthwhile to make the state transition. Furthermore,the threshold value varies with the amount of energyand delay penalties of different state transitions. Sincethere is a length threshold for an idle period, an energysaving technique needs to determine whether an idleperiod on a rank is longer than threshold or not. Ideally,if the idle period is longer than the threshold value,the rank should jump to the low-power state at thebeginning of the idle period; otherwise, we should keepthe rank in the active state. However, it is not easy topredict the length of each idle period, due to dynamicmemory references.

Second, the state transition-based power saving ap-proaches cannot take full advantage of idle periods,especially for memory intensive workloads. In memoryintensive workloads, the number of idle periods is large,and many of the idle periods are too short to be exploitedfor power saving. It is desirable to reshape the page ref-erences to different ranks so that the idle periods becomelonger and the number of idle periods is minimized.

Third, static demotion schemes cannot adapt to dif-ferent workloads and different memory architectures.With page migrations, we further need adaptation forpower management on individual ranks (differentiatingthe rank hotness).

3.2 Workflow of RAMZzz

We propose a novel memory design RAMZzz withdynamic migrations and adaptive demotions to addressthe aforementioned obstacles. We develop a dynamicpage placement policy that is likely to create longeridle periods. The policy takes advantage of recency andfrequency of pages stored in the ranks, and ranks arecategorized into hot and cold ones. The hot ranks tendto have very short idle periods, and the cold ranks withrelatively long idle periods. Page migrations are periodi-cally performed to maintain the rank hotness (the periodis defined as epoch). With dynamic page migrations, shortidle periods are consolidated into longer ones and thenumber of idle periods is reduced on the cold ranks. Onthe other hand, the configuration for adaptive demotionsis determined periodically (the period is called slot).For each slot, a demotion configuration (i.e., the power-down timeouts for all power states) is used to guide thedemotion within the slot.

We further develop an analytical model to periodicallyestimate the idle period distribution of one slot. Our an-alytical model is based on the locality of memory pagesand the idle period distribution of the previous slot.Given an optimization goal (such as minimizing energyconsumption or minimizing ED2), we use the predictionmodel to determine the demotion configuration for the

5

new slot. Since the prediction has much lower over-head than the page migration, a slot is designed to besmaller than an epoch. In our design, an epoch consistsof multiple slots. Figure 1 illustrates the relationshipbetween slot and epoch. RAMZzz performs demotionconfiguration and prediction at the beginning of eachslot and performs page migration at the beginning ofeach epoch.

The overall workflow of RAMZzz is designed asshown in Algorithm 1. RAMZzz maintains the perfor-mance model by updating the data structures used in theprediction model (Section 4.2). As the idle period lengthincreases, actions of the adaptive demotion scheme maybe triggered. At the beginning of each epoch, RAMZzzdecides the page migration schedule and starts to mi-grate the pages to the destination ranks (Section 4.1). Atthe beginning of each slot, RAMZzz performs predictionand determines the demotion configuration for the newslot (Section 4.3). The next section will describe thedesign and implementation details of each component.

Algorithm 1 Workflow of RAMZzz

1: if any memory reference to rank r then2: if rank r is in the low-power state then3: Set r to be ACT;4: Maintains the prediction model; /*Section 4.2*/5: else6: Update the current idle period of rank r;7: Perform demotions (if necessary) according to the demotion

configuration of rank r; /*Section 4.3*/8: if the current cycle is the beginning of an epoch then9: Run page migration algorithm and schedule page migrations;

/*Section 4.1*/10: if the current cycle is the beginning of a slot then11: Determine the demotion configuration for the new slot;

/*Section 4.3*/

4 DESIGN AND IMPLEMENTATION DETAILS

After giving an overview on RAMZzz, we describethe details for the following components in rank-awarepower management: dynamic page migration, predictionmodel and adaptive demotions. Finally, we discuss someother implementation issues in integrating RAMZzz intomemory systems.

4.1 Dynamic Page Migration

When an epoch starts, we first group the pages accordingto their locality and each group maps to a rank inthe DRAM. Next, pages are migrated according to themapping from groups to ranks.

Rank-aware page grouping. We place the pages withsimilar hotness into the same rank. We adopt the mainmemory management policy named MQ [37]. We brieflydescribe the idea of MQ, and refer the readers to theoriginal paper for more details. MQ has M LRU queuesnumbered from 0 to M -1. We assume M = 16 followingprevious studies [37], [38]. Each queue stores the pagedescriptor including the page ID, a frequency counterand a logical expiration time. The queue with a larger

ID stores the page descriptors of those most frequentlyused pages. On the first access, the page descriptor isplaced to the head of queue zero, with initializationon its expiration time. A page descriptor in Queue i ispromoted to Queue i + 1 when its frequency counterreaches 2i+1. On the other hand, if a page in Queue iis not accessed recently based on the expiration time,its page descriptor will be demoted to Queue i − 1.We use a modified MQ structure to group physicalmemory pages [38]. The updates to the MQ structure areperformed by the memory controller, which is designedto be off the critical path of memory accesses. Moreimplementation details are described in Appendix A ofthe supplementary file.

An observation in MQ is that MQ has clustered thepages with similar access locality into the same queue.Moreover, unlike LRU, MQ considers both frequencyand recency in page accesses (we study how the loca-tions of pages in the MQ queues correlate with theiraccess patterns in Appendix F of the supplementary file).As a result, we have a simple yet effective approach toplace the pages in the ranks. Suppose each rank has adistinct hotness value. We assign the rank that a page isplaced in a manner such that: given any two pages p andp′ with the descriptors in Queues q and q′, p and p′ arestored in ranks r and r′ (r is hotter than r′) if and onlyif q > q′ or if q = q′ and p is ahead of p′ in the queue.That means, the pages whose descriptors are stored in ahigher queue in MQ are stored in hotter ranks. Withinthe same queue in MQ, the more recently accessed pagesare stored in hotter ranks. Algorithm 2 shows the processof grouping the pages into R sets, and each set of pagesis stored in a memory rank. Each rank has a capacity ofC pages.

Algorithm 2 Obtain R page groups in the increasinghotness1: initiate R empty sets, S0, S1, ..., SR−1;2: curSet = 0;3: for Queue i = M − 1, M − 2, ..., 0 in MQ do4: for Page p from head to tail in Queue i do5: Add p to ScurSet ;6: if |ScurSet | = C then7: curSet ++;

Figure 2 illustrates an example of page placement ontothe ranks. There are four ranks in DRAM, and each rankcan hold two pages. At epoch i, we run Algorithm 2 onthe MQ structures, and obtain the page placement on theright. For example, P6 and P7 belong to Q3, which are thehottest pages, and they are placed into the hottest rank(here r0). At epoch i+ 1, there are some changes in theMQ (the underlined page descriptors) and the updatepage placement is shown on the right.

Page migrations. To update page placement at eachepoch, we first need to determine the mappings fromgroups to ranks, i.e., which rank stores which set (orgroup) of pages determined in Algorithm 2. Accordingto the current page placement among ranks, different

6

Q0

Q1

Q2

Q3

P0 P1

P2 P3

P4 P5

P6 P7

At the start of epoch i

At the start of epoch i+1

Q0

Q1

Q2

Q3

P0 P1

P5

P4 P3

P2 P7

P6

P6 P7

P4 P5

P2 P3

P0 P1

DRAM

r1

r2

r3

r0

P2 P7

P4 P3

P6 P5

P0 P1

r1

r2

r3

r0

hot

hot

Fig. 2. An example of page placement on ranks.Placement

at epoch i

P6 P7

P4 P5

P2 P3

P0 P1

r1

r2

r3

r0 P2 P7

P4 P3

P6 P5

P0 P1

hothot

2

1

1

1

11

1 r0 r1 r2

P6P4

P2

(a) (b)

Fig. 3. An example of page migrations: (a) calculate themaximum matching on the bipartite graph; (b) calculateEulerian cycle for page migrations.

mappings from groups to ranks can result in differ-ent amounts of page migrations, leading to differentamounts of penalty in energy and latency. We shouldfind a mapping to minimize page migrations.

We formulate this problem as finding a maximumweighted matching on a balanced bipartite graph. Thebipartite graph is defined as G whose partition has theparts U and V . Here, U and V are defined as the pageplacement among ranks in the previous epoch and thepage groups obtained with Algorithm 2 in the currentepoch respectively. An edge between ri and Sj has aweight equaling to the number of pages that exist in bothrank ri and Sj . Since |U | = |V |, that is, the two subsetshave equal cardinality, G is a balanced bipartite graph.We find the maximum weighted matching of such abalanced bipartite graph with the classic Hopcroft-Karpalgorithm. The maximum weighted matching meansthe maximum number of pages that are common inboth sides, and equivalently the matching minimizes thenumber of page migrations. Figure 3(a) illustrates thecalculation of the maximum matching for the bipartitegraph for the example in Figure 2. In this example, thereare multiple possible matchings with the same maximummatching weight. The thick edges represent one of suchmaximum matchings.

After the page mappings to individual ranks are de-termined, we know which pages should be migratedfrom one rank to another. Then, we need to schedulethe page migrations in a manner to minimize the runtimeoverhead. Inspired by the Eulerian cycle in graph theory,we develop a novel approach to perform multiple pagemigrations in parallel. We consider a labeled directedgraph Gm where each node represents a distinct rank.An edge from node ri to node rj is labeled with a pagedescriptor, representing the pages to be migrated fromrank ri to rank rj .

Each strongly connected component of Gm has Eule-rian cycles. According to graph theory, a directed graphhas a Eulerian cycle if and only if every vertex has equalin degree and out degree, and all of its vertices withnonzero degree belong to a single strongly connectedcomponent. By definition, each strongly connected com-ponent of Gm satisfies both properties, and thus we canfind Eulerian cycles in Gm. The page migration followsthe Eulerian cycle. We divide the Eulerian cycle intomultiple segments so that each segment is a simple pathor cycle. Then, the page migrations in each segment canbe performed concurrently. Figure 3(b) illustrates oneexample of Eulerian cycle according to the maximummatching on the left. The three migrations form a Eule-rian cycle, and they are performed in one segment.

To facilitate concurrent page migrations according tothe Eulerian cycle, each rank is equipped with oneextra row-buffer for storing the incoming page. Whenmigrating a page, a rank first writes the outgoing page tothe buffer of the target rank, and then reads the incomingpage from its buffer. We provide more implementationdetails and overhead analysis in Appendix A of thesupplementary file.

4.2 Prediction Model

When a new slot starts, we run a prediction modelagainst each rank. The model predicts the idle perioddistribution. Our estimation should be adapted to thepotential changes in the page locality as well as the setof pages in each rank.

We use the histogram to represent the idle perioddistribution. Suppose the slot size is T cycles, and thehistogram has T buckets. We denote the histogram tobe Hist [i ], i = 0, 1, ..., T . The histogram means thereare Hist [i ] number of idle periods with the length ofi cycles each. One issue is the storage overhead of thehistogram. A basic approach is to store the histograminto an array, and each bucket is represented as a 32-bit integer. However, the storage overhead of this basicapproach is too high. Consider a slot size of 108 cycles inour experiments. The basic approach consumes around400MB per rank. In practice, the histogram is usuallyvery sparse, and there are at most

√T idle periods longer

than√T cycles. Thus, we develop a simple approach to

store the short and the long idle periods separately. Inparticular, we maintain two small arrays: the histogramcounters for the short idle periods no longer than

√T

cycles, and another array of√T integers to store the

actual lengths of the long idle periods that are longerthan

√T cycles. This simple approach reduces the stor-

age overhead to 2√T integers. It takes only 80KB per

rank to support a slot size of 108 cycles. We calculate thehistogram for idle periods longer than

√T cycles with

just one scan on the array.Our estimation specifically consider page migrations.

If the new slot is not the beginning of an epoch, thereis no page migration and we use the actual histogram

7

in the previous slot, Hist ′[i], to be the prediction of thecurrent slot, i.e., Hist [i] = Hist

′[i] (0 ≤ i ≤ T ). Otherwise,we need to combine the access locality of the migratedpages with the historical histogram.

Our estimation after page migration works as follows.We model the references to the same page conformingto a Poisson distribution. Suppose a page i is accessedwith f times in a slot. Under the Poisson distribution,the probability of having one access to page i within acycle is pi =

g·f

T, where g is the memory access latency. In

our implementation, we take advantage of the frequencycounter and the expiration time in the MQ structure (asdescribed in the previous section) to approximate pi. Thisalready offers a sufficiently accurate approximation inpractice. Given a rank consisting of N pages (pages 0, 1,..., N − 1), the probability of an idle cycle in the rank isQ = (1− p0) · (1− p1)... · (1− pN−1). Based on Q, we canestimate the probability of forming an idle period withlength of k cycles (followed by a busy cycle in (k + 1)th

cycle). That is, the probability of having an idle periodof k cycles is Wk = Qk · (1−Q).

We denote the old values of those probability values inthe previous epoch to be W ′

k (k=0, 1, 2, ..., T ). After pagemigrations, we calculate Wk (k=0, 1, 2, ..., T ) accordingto the updated pages in the rank. Given the actualhistogram in the previous slot, Hist ′[i], we can estimatethe histogram of the new slot with the ratio Wi/W

′

i ,that is, Hist+[i] = Wi/W

′

i ·Hist ′[i]. Finally, we normalizethe histogram so that the histogram represents the total

time length of a slot. Denote s′=∑T

i=0(Hist+[i] · (i + g)).

We normalize the histogram with the value of Ts′

, i.e.,Hist [i] = Hist

+[i] × Ts′

. We use Hist [i] as the predictionon the idle period distribution for the new slot.

Based on the prediction model, we will estimate thepower-down timeout for the new slot in the next sub-section.

4.3 Adaptive Demotions

With the predicted idle period distribution, there areopportunities to avoid the state transitions upon thoseshort idle periods, and to have instant state transitionsfor long idle periods. For example, if we know all theidle periods are expected to be very long, we can setthe power-down timeout to be zero, thus performinginstant state transitions. Thus, we have developed asimple approach to reduce the total penalty of statetransitions. The basic idea is, for each low-power state,we use one power-down timeout to determine the statetransition within the entire slot. Suppose a DDR-seriesmemory architecture has M low-power states, denotedas S1, . . . , SM in the descending order of their powerconsumptions. For each low-power state Si, RAMZzzperforms the state transition to Si after an idle periodthreshold ∆i. If the idle period is shorter than ∆i,RAMZzz does not make the state transition to Si.

Since we need to exploit all power states in orderto adapt to different workloads and different memory

architectures, a naive approach is to consider all thepossible state transitions. However, the demotion config-uration of the naive approach is too complex to derive.Instead of considering all state transitions, we viewmultiple state transitions as a chain of state transitionsfrom higher-power states to lower-power states. We willshow that our adaptive demotion scheme can identifythe unnecessary power states in a chain of states, andthus further simplify the demotion scheme. We definethe demotion configuration to be a vector of demotiontimes ~∆ = (∆1, . . . ,∆M ) where ∆i represents the power-down timeout of low-power state Si, i = 1, · · · ,M . In thechain, when the idle period length is longer than ∆i, weperform states transition from Si−1 to Si .

Given the estimated histogram on idle periods, weestimate the demotion configuration of each rank for agiven optimization goal. We use energy consumption asthe optimization goal to illustrate our algorithm designon estimating the demotion configuration. One can sim-ilarly extend it to other goals such as ED2. Since thechoice on different power-down timeouts does not affectthe energy consumption of memory reads and writes,our metric can be simplified as the total energy con-sumption of background power and the state transitionpenalty.

We analyze the energy consumption on different de-motions over an idle period. Suppose the idle periodlength is t cycles, and the power consumption of activestate ACT and a low-power state Si are PACT and PSi

(i = 1, · · · ,M ), respectively. Given a demotion configu-ration ~∆, if t ≤ ∆1, there is no state transition to low-power states. Otherwise, denote I(t) to be the maximumi such that ∆i < t (i = 1, · · · ,M ). In the chain, there areat most I(t) state transitions, from S1 to SI (t). At theend of the idle period, a memory access comes and therank transits from low-power state SI (t) back to ACT.Thus, the energy consumption of the idle period can becalculated as B(~∆, t) in Eq. (1).

B(~∆, t) = PACT ·∆1 +∑I (t)−1

j=1 (PSj· (∆j+1 −∆j))

+PSI(t)· (t −∆I (t)) +ESI(t)

(1)

where ESI(t)is resynchronization energy penalty from

low-power state SI (t) back to ACT.Given the histogram Hist [t] (t = 0, 1, · · · , T ), each

Hist [t] means there are Hist [t] idle periods with lengtht cycles. We can calculate the total energy consumptionfor all the idle periods, as E(~∆) in Eq. (2).

E(~∆) =

∆1∑

t=0

(PACT · t · Hist [t]) +T∑

t=∆1+1

(B(~∆, t) ·Hist [t]) (2)

RAMZzz also allows users to specify a delay budget tolimit the delay penalty incurred by state resynchroniza-tion. We calculate the total resynchronization delay asD(~∆) in Eq. (3).

D(~∆) =T∑

t=∆1+1

(RSI(t)·Hist [t]) (3)

where RSI(t)is resynchronization delay from low-power

state SI (t) back to ACT. Our goal is to determine the

8

Algorithm 3 The greedy algorithm to find the suitabledemotion configuration ~∆

Input:All low-power states set ~S = (S1, . . . , SM ), and associated powerconsumptions set ~P = (PS1

, . . . , PSM);

Initialization:~∆ = φ, ~Sselect = φ;

1: while |~Sselect | 6= M do2: for all Si ∈ ~S do3: Add Si into ~Sselect ;4: for each possible ∆i value do

5: Calculate E(~∆) using Eq. (2) with selected low-powerstates subset ~Sselect ;

6: Find the suitable ∆i that has the best E(~∆);7: Remove Si from ~Sselect ;8: Find the low-power state Sk that has a best E(~∆);9: Add ∆k into ~∆;

10: Remove Sk from ~S;Output:

power-down timeout set ~∆

suitable demotion configuration ~∆ so that E(~∆) is mini-mized. If a delay budget is given, we choose the ~∆ valuethat minimizes E(~∆) with the constraint that the totaldelay D(~∆) is no larger than the given delay budget.

We note that E(~∆) is neither concave nor monotonic.Therefore, we have to iterate all the possible valuesfor ∆i=0, 1, ..., T (i = 1, · · · ,M ), and find the bestcombination of ∆i (i = 1, · · · ,M ). The complexity ofthis naive approach of increases exponentially with thenumber of low-power states in the DRAM architecture.In the following, we develop an efficient greedy algo-rithm to find a reasonably good demotion configuration(illustrated in Algorithm 3).

We start by assuming that only one low-power state isused in the entire slot, and select the best suitable low-power state and its power-down timeout which leadsto a smallest estimated E(~∆) among all M low-powerstates. Then, we keep the estimated demotion time of theselected low-power state unchanged, and select a newlow-power state and its power-down timeout from therest M − 1 low-power states, which results in a smallestestimated E(~∆). We repeat this process to add one morenew low-power state into the previous selected subset oflow-power states together with its power-down timeoutin each step. Algorithm 3 has much lower computationalcomplexity than the naive approach.

Algorithm 3 has a low runtime overhead in most cases.First, it does not need to iterate through all values from0 to T (T is the slot size). Instead, it only searchesthose values with non-zero frequencies in the predictedhistogram. This number is far smaller than T in practice.Second, as more low-power states are selected duringthe process (one state per step), the search space for restlow-power states is further reduced since the power-down timeout of Si is bounded by that of Si−1 andSi+1 , i.e., ∆i−1 ≤ ∆i ≤ ∆i+1 . Moreover, we furtheroptimize Algorithm 3 in two ways. First, we adopt thebranch-bound optimization in order to further reducethe search space (That is, we try possible values from

the highest to the lowest until the program performancepenalty violates the given budget). Second, we use anexponential search approach by iterating in the form of2i (0 ≤ i ≤ log2T ) for each power-down timeout. Onthe current architectures, the greedy algorithm has a lowruntime overhead and provides near-optimal demotionconfigurations, as shall be shown in our evaluation(Section 5).

The adaptive demotion scheme is applied on each rankat the beginning of a slot. The demotion configurationscan be different among different ranks and at differentslots. This is a distinct feature of adaptive demotion, incomparison with the previous work on static demotionschemes [5], [6], [7], [9].

4.4 Other Implementation Issues

RAMZzz can be implemented with a combination ofmodest hardware and software supports. First, RAMZzzadds a few new components to the memory controllerand operating system. Following the previous study [38],RAMZzz extends a programmable controller [39] byadding its own new components. Four new modulesincluding MQ, Migration, Remap and Demotion areadded into the memory controller for implementing thefunctionality of page grouping, page migration, pageremapping and power state control in RAMZzz, re-spectively. Other functionalities including page groupingand the prediction model are offloaded to the OS (likeprevious studies [5], [38]). Second, we note that thestructure complexity and storage overhead of RAMZzzare similar to the previous proposals, e.g., [6], [9], [10],[29], [40]. For example, our design has small DRAMspace requirement (less than 2% of the total amount ofDRAM). We have included both performance and energypenalties of these new modules in our simulation andevaluation, and demonstrated that their overheads areacceptable in current architectures in Section 5.

In Appendix A of the supplementary file, we providemore discussions on implementation details, includingenergy/performance/storage overhead analysis and op-timizations of RAMZzz.

5 EVALUATION

In this section, we evaluate our design using ED2 andenergy consumption as metrics. We have conducteda number of experiments. We compare the behaviorof RAMZzz and its alternatives in order to show theeffectiveness of RAMZzz on different memory architec-tures, and the impact of individual techniques. We focuson the DRAM component. For the space interests, wepresent the results with ED2 as the optimization goal inthis paper, and leave the results with the total energyconsumption as the optimization metric to Appendix Cof the supplementary file. We also study the impact ofRAMZzz on full system energy savings in Appendix Dof the supplementary file.

9

5.1 Methodology

Our evaluation is based on trace-driven simulations.In the first step, we use cycle-accurate simulators tocollect memory access traces (last-level cache misses andwritebacks) from running benchmark workloads. In thesecond step, we replay the traces using our detailedmemory system simulator. Our simulation models allthe relevant aspects of the OS, memory controller, andmemory devices, including page replacements, memorychannel and bank contention, memory device power andtiming, and row buffer management. The memory con-troller exploits the page interleaving mechanism. Moreimplementation details can be found in Appendix A ofthe supplementary file.

We evaluate workloads from SPEC 2006 and PAR-SEC [11]. We use two different approaches to collect theirmemory traces: one with PTLSim [41] simulator and theother with Sniper [42] simulator. On the one hand, thememory footprints of SPEC 2006 are usually smaller thanthose of PARSEC. On the other hand, PARSEC cannotrun on PTLSim, which we use to collect the memorytrace from SPEC 2006. Also, PTLSim can offer morecontrol on the hardware configurations for sensitivitystudies.

SPEC 2006 Workloads. We use PTLSim [41] to col-lect memory access traces of SPEC 2006 workloads.The main architectural characteristics of the simulatedmachine are listed in Table 2. We model and conductthe evaluation with an in-order processor followingprevious studies [29], [38]. More complex and recentprocessors are studied with Sniper-based simulations.We evaluate our techniques with three different memoryarchitectures, as shown in Table 1 (Section 2.1). Thosememory architectures are used in different computingsystems. We simulate different capacities (1GB, 2GB and4GB) and different numbers of ranks (4, 8, 12 and 16)for the memory system. All the ranks have the sameconfigurations (DRAM parameters) and capacities. Bydefault, we assume a 2GB DRAM with 8 ranks. We pickthese small memory sizes to match the footprint of theworkloads’ simulation points. We calculate the memorypower consumption following Micron’s System PowerCalculator [18], with the power and delay illustrated inTable 1. The energy and performance overheads causedby new MC and OS modules (e.g., remapping, migrationand demotion) are derived from our analysis in Ap-pendix A of the supplementary file, which are consistentwith those of other authors [38], [40], [43].

We have used 19 applications from SPEC 2006 withthe ref inputs. These workloads have widely differentmemory memory access rates, footprints and localities.Due to space limitations, we do not present the resultsfor single applications; instead, we report their geometricmean (GM), and also four particular applications withdifferent memory intensiveness. They are omnetpp, cac-tusADM, mcf and lbm (denoted as S1, S2, S3 and S4,respectively). To assess our algorithm under the context

TABLE 2Architectural configurations of PTLSim. The default

setting is highlighted.Component FeaturesCPU In-order cores running at 2.66GHzCores 4TLB 64 entriesL1 I/D cache (per core) 48KBL2/L3 cache (shared) 256KB/4MBCache line/OS page size 64B/4KBDRAM DDR3-1333, DDR2-800, LPDDR2-800Channels 4Ranks 4, 8, 12, 16Capacity (GB) 1, 2, 4Delay and Power see Table 1

TABLE 3Mixed workload: memory footprint (FP), memory

accesses statistics per 5× 108 cycles (Mean and StdevMean

).

Name FP(MB)

Mean

(106)

StdevMean

Applications

M1 661.3 0.6 1.02 gromacs, gobmk, hmmer, bzipM2 1477.4 1.7 1.11 bzip, soplex, sjeng, cactusADMM3 626.6 2.9 0.59 soplex, sjeng, gcc, zeusmpM4 537.8 3.5 0.47 zeusmp, gcc, leslie3d, omnetppM5 1082.9 4.4 0.71 gcc, leslie3d, calculix, gemsFDTDM6 988.2 7.8 0.40 libquantum, milc, mcf, lbm

of multi-core CPUs, we study mixed workloads of fourdifferent applications from SPEC 2006 (Table 3). The fourworkloads start at the same time. The mixed workloadsform multi-programmed executions on a four-core CPU,ordered by the average number of memory accesses(Mean). The standard deviation and mean values arecalculated based on memory access statistics per 5× 108

CPU cycles. For each workload, we select the simulationperiod of 15 × 109 cycles in the original PTLSim simu-lation, which represents a stable and sufficiently longexecution behavior.

PARSEC Workloads. Since current PTLSim can-not support PARSEC benchmarks, we use anothersimulator–Sniper [42] to collect memory access traces ofPARSEC. We also note that, some workloads in PARSEClike dedup, facesim, canneal cannot run successfully. Inthis study, we focus on the results for four applicationsincluding blackscholes, bodytrack, ferret and streamclus-ter. By default, we use the simulated CPU architectureas shown in Table 4 (Intel’s Gainestown CPUs), whichsimulates a four-core processor running at 2.66 GHzbased on the Intel’s Nehalem micro architecture. Bydefault, we simulate a four-core CPU, and 2GB DRAMwith 8 ranks. The memory architecture has the samepower consumption and performance configurations asthe PTLSim-based simulations. Also, we conduct simula-tion studies with main-stream servers with a large num-ber of cores and large memory capacity in Section 5.5.Each PARSEC workload runs with four threads, andeach thread is assigned to one core. We use the sim-medium inputs for PARSEC workloads, and perform themeasurement on the specified Region-of-Interest (ROI)of PARSEC workloads [11].

Comparisons. In our previous study [36], we havealready shown that the preliminary version of RAMZzz

10

TABLE 4Architectural configurations of Sniper. The default setting

is highlighted.Component FeaturesCPU Out-order cores running at 2.66GHzCores 4, 8, 16DTLB/ITLB 64/128 entriesL1 I/D cache (per core) 32KB/32KBL2 cache (per core) 256KBL3 cache (shared) 8MBCache line/OS page size 64B/4KBDRAM DDR3-1333Channels 4Ranks 8, 16Capacity (GB) 2, 32, 64Delay and Power see Table 1

significantly outperforms other power managementtechniques [6], [9], [12] in terms of both ED2 and energyconsumption. Due to the space limitation, we focuson evaluating the impact of individual techniques de-veloped in this paper. In particular, we consider twoRAMZzz variants namely RZ–SP and RZ–SD. They arethe same as RAMZzz except that RZ–SP uses the staticpage management scheme without page migrations,whereas RZ–SD uses the static demotion scheme. Thestatic demotion scheme simply transits a rank to a pre-selected low-power state according to the predictionmodel.

In addition to RAMZzz variants, we also simulatethe following techniques for comparison. All the metricsreported in this paper are normalized to those of BASE.

• No Power Management (BASE): no power manage-ment technique is used, and ranks are kept activeeven when they are idle.

• Ideal Oracle Approach (ORACLE): ORACLE is thesame as RAMZzz, except the power-down timeoutin ORACLE is determined with the future informa-tion, instead of history. Specifically, at the beginningof each slot, we perform an offline profiling onthe current slot, and get the real histogram of idleperiods. Based on the histogram, we calculate theoptimal power-down timeout.

RAMZzz allows users to specify the slot and epochsizes and delay budgets. By default, the slot size is 108

cycles and an epoch consists of ten slots (109 cycles), anddelay budget is set to be 4% of the slot size. We evaluatethe impacts of these parameters in Appendix E of thesupplementary file.

Idle Period Distribution. We study the distributionof idle periods. Figure 4 shows the histogram of idleperiod lengths of the collected traces on Rank 0 on DDR3under BASE approach. Many idle periods are too shortto be exploited for state transitions, e.g., shorter than thethreshold idle period length for demoting to SR FAST(2500 cycles on DDR3). We observed similar results onother ranks.

5.2 Results on SPEC 2006 Workloads

We first compare the algorithms with the optimizationgoal of ED2 on SPEC 2006 workloads, because ED2 is a

!

!"

!""

!"""

!""""

!"""""

"

####

$%$%

%#%#

&%&%

!#$#'

!((($

!)!)"

#!(!'

#$$$#

#%&%%

'!'!"

'(((#

$"#&&

$'('!

$)"%*

(%#*%

**)*#

)#&#!

!"#$%&'()'*+,%'-%&*(+.

/+,%'-%&*(+',%0123'4565,%.7

Fig. 4. The histogram of idle periods with M2 on Rank 0.

widely used metric for energy efficiency.We study the overall impact of RAMZzz in compar-

ison with BASE and ORACLE. The comparison withBASE shows the overall effectiveness of energy savingtechniques of RAMZzz, and the comparison with OR-ACLE shows the effectiveness of our prediction model.Figure 5 presents normalized ED2 results for RAMZzzand ORACLE approaches on three different DRAM ar-chitectures (more randomly-mixed workloads, which arechosen from SPEC 2006, are evaluated in Appendix Bof the supplementary file). If the normalized ED2 ofan approach is smaller than 1.0, the approach is moreenergy efficient than BASE.

Thanks to the rank-aware power management,RAMZzz is significantly more energy-efficient thanBASE. Compared with BASE, the reduction on ED2 is64.2%, 63.3% and 63.0% on average on DDR3, DDR2and LPDDR2, respectively. The reduction is more sig-nificant on the workloads of single applications (e.g.,S1–S4) than the mixed workloads. There are two mainreasons. First, since the single-application workload hasa smaller memory footprint, the page migration hasa smaller overhead and the number of cold ranks islarger. The number of page migrations becomes verysmall after the first few epochs. In contrast, the exe-cution process of the workloads with a large memoryfootprint (such as M5 and M6) consistently has a fairamount of page migrations at all epochs. Secondly, onsingle-application workloads, there are more opportu-nities for saving background power using lower-powerstates (such as SR FAST and SR SLOW in DDR3, SR inDDR2 and LPDDR2). Figure 6 shows the breakdown oftime stayed in different power states for RAMZzz onDDR3, DDR2 and LPDDR2. In Figure 6, each powerstate represents the percentage of time when ranks are inthis state during the total simulation period. And Othersrepresents the percentage of time that includes DRAMoperations, page remapping delay, page migration delayand resynchronization delay. As the workload becomesmore memory-intensive, the portion of time that a rankis in lower-power states becomes less significant, indi-cating that many idle periods are too short and they arenot worthwhile to perform state transitions into lower-power states (even with page migration). For the lessmemory-intensive workloads like S1–4 and M1, lower-power states have very significant portions in the totalsimulation time, indicating significant energy saving

11

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

ED

2

ORACLE RAMZzz

(a) Results on ED2 for DDR3

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

ED

2

ORACLE RAMZzz

(b) Results on ED2 for DDR2

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

ED

2

ORACLE RAMZzz

(c) Results on ED2 for LPDDR2

Fig. 5. Comparing ED2 of RAMZzz and ORACLE with the optimization goal of ED2 on three memory architectures.

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time

Others SR_SLOW SR_FAST PRE_PDN_SLOW PRE_PDN_FAST ACT_PDN ACT

(a) The breakdown of time for DDR3

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time

Others SR PRE_PDN ACT_PDN_SLOW ACT_PDN_FAST ACT

(b) The breakdown of time for DDR2

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time

Others SR PRE_PDN ACT_PDN ACT

(c) The breakdown of time for LPDDR2

Fig. 6. The breakdown of time stayed in different power states for RAMZzz with the optimization goal of ED2.

compared with BASE.

It can also be seen from Figure 5 that RAMZzzachieves a very close ED2 to ORACLE on all workloadsand memory architectures. RAMZzz achieves 5.7%, 4.4%and 3.7% on average larger ED2 than ORACLE on DDR3,DDR2 and LPDDR2, respectively. This good result isbecause our histogram-based prediction model is ableto accurately estimate the suitable power-down timeoutfor the sake of minimizing ED2. Figure 7 comparesRAMZzz’s estimated power-down timeouts to SR FASTwith ORACLE on ranks 0 and 2 of executing M4 onDDR3. Our estimation is very close to the optimal valueon the two ranks. We observe similar results for differentranks and different workloads and also for the power-down timeouts of other low-power states and otherDRAM architectures. We also find that our model hashigh accuracy in predicting rank idle period distribution(detailed results are presented in Appendix G of thesupplementary file).

We have further made the following observations onthe result of breakdown in Figure 6. First, on a specificmemory architecture, the portion of time for differentlow-power states varies significantly across differentworkloads. Different workloads have different choiceson the most energy-effective low-power state. For mostsingle-application workloads, RAMZzz makes the deci-sion to demote into SR SLOW on DDR3 in most idleperiods, whereas the decision of demotion is to SR FASTor PRE PDN SLOW for the mixed workloads. Second,on different DRAM architectures, the portion of timefor different low-power states varies significantly, evenfor the same workload. SR on LPDDR2 has a muchhigher significance in all workloads than on DDR3 andDDR2. That is because, as we have seen in Table 1, SRon LPDDR2 consumes a similar normalized power con-sumption but a relative smaller resynchronization timewhen compared with the other two DRAM architectures.

These two observations have actually demonstrated theeffectiveness of adaptive demotions of RAMZzz for dif-ferent workloads and different memory architectures.We will experimentally study the impact of adaptivedemotions in Section 5.3.2.

Figures 8 and 9 show the breakdown of time stayed indifferent power states for BASE and ORACLE on DDR3,respectively. Compared with Figure 6(a), RAMZzz has avery similar power state distribution to ORACLE on allworkloads, which again demonstrates the effectivenessof our estimation. Compared to BASE, both RAMZzzand ORACLE significantly reduce the percentage of timewhen ranks are in the ACT state by the adaptive use ofall available low-power states. We observe similar resultsfor other workloads and DRAM architectures.

Next, we study the performance delay in detail. Fig-ure 10 shows the breakdown of performance delay forRAMZzz on DDR3. We divide the delay into threeparts: resynchronization delay (caused by state transi-tions), migration delay (caused by page migrations) andremapping delay (caused by Remapping Table lookup andaddress remapping). The performance delay of RAMZzzis well controlled under the pre-defined penalty budget(i.e., 4% in this experiment). The results demonstrate thatour model is able to limit the performance delay withinthe pre-defined threshold. The resynchronization delaycontributes the largest portion of performance delay onmost workloads. Due to concurrent migrations, the mi-gration delay is kept within an acceptable range (i.e., lessthan 1.5% on all workloads). The remapping delay onlyaccounts for a very small portion of the performancedelay on all workloads. This observation is consistentwith our analysis in Appendix A of the supplementaryfile. The remapping operation is performed when arequest is added to the MC queues and does not extendthe critical path in the common case because queuingdelays at the MC are substantial.

12

3/12/2013

0

10000

20000

30000

40000

50000

60000

1 8

15

22

29

36

43

50

57

64

71

78

85

92

99

106

113

120

127

134

141

148

Demotiontime(cyc) Rank0 (OPT)

Rank0 (RAMZzz)

0

50000

100000

150000

1 8

15

22

29

36

43

50

57

64

71

78

85

92

99

106

113

120

127

134

141

148

Demotiontime(cyc) Rank2 (OPT) Rank2 (RAMZzz)

Fig. 7. Power-down time-out comparison

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time


Fig. 8. The breakdown oftime for BASE.

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time


Fig. 9. The breakdown oftime for ORACLE.

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

4.5%

Perf

orm

ance

Del

ay

Resynchronization Delay Migration Delay Remapping Delay

Performance Delay Budget

Fig. 10. The breakdown ofdelay for RAMZzz.

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

Tota

l Mig

ratio

n D

elay

Delay without concurrent migrations Delay with concurrent migrations

Fig. 11. The optimizationof page migration delay.

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nor

mal

ized

ED

2

RZ-SP RAMZzz

Fig. 12. Comparing ED2 ofRAMZzz and RZ–SP.

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time


Fig. 13. The breakdown oftime for RZ–SP.

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nor

mal

ized

ED

2

ACT_PDN PRE_PDN_FAST PRE_PDN_SLOW SR_FAST SR_SLOW RAMZzz

Fig. 14. Comparing ED2 ofRAMZzz and RZ–SD.

As seen from Figure 10, the migration delay is higheron the workloads with a large memory footprint (suchas S3, M5 and M6). To further study the migrationdelay, Figure 11 presents the total migration delay ofRAMZzz with/without our graph-based optimizationson DDR3. Thanks to our graph-based optimizations (asdescribed in Section 4.1), the total migration delay issignificantly decreased, with the reduction of 50.0% to74.4%. Concurrent migrations prevent significant perfor-mance degradation in all workloads.

Finally, we discuss the overhead of calculating themigration information (Eulerian cycle) and the demotionconfiguration. We find that the number of those valueswith non-zero frequencies in the predicted histogram isfar smaller than the slot size (108) in our evaluation.Thus, the search space of Algorithm 3 is acceptableat runtime. The average time for calculating of thedemotion configuration is around several millisecondson current architectures. Such calculation is performedonly once per slot (around 40 ms by default). Moreover,the calculation of the migration information is performedonly once per epoch (around 400 ms by default). Thus,their overheads are low on current architectures. Theresults are consistent with previous studies [20], [29],[38], [40].

5.3 Individual Impacts

We now study the individual impact of dynamic mi-grations and adaptive demotions on RAMZzz with theoptimization goal of ED2 on SPEC 2006 workloads.Due to the space limitation, we present the figures forDDR3 memory architecture only and comment on otherarchitectures without figures when appropriate.

5.3.1 Studies on Dynamic Migrations

We study the impact of dynamic migrations, comparingRAMZzz and RZ–SP (RZ–SP uses the adaptive demotionscheme with no page migration). Figure 12 presents ED2

results for RAMZzz and RZ–SP on DDR3.

RAMZzz has much lower ED2 than RZ–SP, with anaverage reduction of 21.8%. The reduction depends onthe memory footprint and memory access intensiveness.The reduction is more significant on memory-intensiveworkloads (such as M4) or workloads with small mem-ory footprint (such as S3 and S4). If a workload hasa small memory footprint, the page migration has asmall overhead on both the delay and the energy con-sumption, and the portion of cold ranks is higher. Ifthe memory access of a workload is more intensive,many idle periods are too short and RZ–SP has lessopportunities for saving background power (even withour proposed adaptive demotion scheme). On those twokinds of workloads, page migration is important for theeffectiveness of power management. In contrast, whenthe memory access is less intensive or has a large mem-ory footprint, RZ–SP is quite competitive to RAMZzz.The example workloads include S1 and M2.

Figure 13 shows the breakdown of time stayed in dif-ferent power states for RZ–SP on DDR3. Comparing withFigure 6(a), the portion of time of those lower-powerstates with a higher resynchronization time are muchsmaller. RZ–SP demotes the ranks into PRE PDN FASTand PRE PDN SLOW states in most times, whereasRAMZzz demotes into even lower-power states, i.e.,SR FAST and SR SLOW. That is because dynamic pagemigration is able to create longer idle periods. For exam-ple, the percentage of SR SLOW is almost zero in RZ–SPfor memory intensive workloads, such as S3, S4 and M2–4, while SR SLOW has a significant portion in RAMZzzfor those workloads. Since page migrations are disabledin RZ–SP, the total delay of RZ–SP is slightly smallerthan that of RAMZzz, less than 2.5% for all workloads.We show that RAMZzz is still more energy-efficient thanRZ–SP on full system ED2 (or energy consumption) inAppendix D of the supplementary file.

To summarize the impact of dynamic page migrations,we observe RAMZzz has much lower ED2 than RZ–SPon three DRAM architectures, with an average reductionof 21.8%, 15.8% and 17.1%, and a range of 5.2–45.1%, 2.8–

13

TABLE 5Comparing ED2 of RAMZzz with different number of

low-power states on three DRAM architectures on M1.Number of low

-power states1 2 3 4 5

DDR3 0.67 0.59 0.40 0.31 0.27DDR2 0.68 0.43 0.35 0.30 N/A

LPDDR2 0.59 0.41 0.33 N/A N/A

39.6% and 1.7–41.1% on DDR3, DDR2 and LPDDR2 re-spectively. The reduction is more significant on memory-intensive workloads or workloads with small memoryfootprint on DDR2 and LPDDR2.

5.3.2 Studies on Adaptive Demotions

In this section, we study the impact of adaptive demo-tions, that is to compare the performance of RAMZzzand RZ–SD (RZ–SD uses the dynamic page migrationwithout the adaptive demotion).

Figure 14 presents the comparison of ED2 for RAMZzzand RZ–SD on DDR3. We compare the performance ofRAMZzz with every possible RZ–SD approach on allworkloads. That is, we use every available low-powerstate as the pre-selected low-power state in the RZ–SDapproach. Since DDR3 has five low-power states, wehave five RZ–SD approaches where each approach isdenoted as the name of pre-selected low-power stateFigure 14 (such as SR FAST represents the RZ–SD ap-proach which uses SR FAST as the pre-selected low-power state).

We observe that RAMZzz outperforms all RZ–SDapproaches on all workloads, with the reduction from26.4% to 51.1% (36.4% on average). Moreover, differentworkloads have different choices on the most energy-efficient RZ–SD approach, indicating that the static de-motion scheme can not adapt to different workloads. Theefficiency of the static demotion scheme is closely relatedto the decision on the pre-selected low-power state,justifying the necessity of adaptive demotions. The totaldelay of RZ–SD is close to that of RAMZzz, less than3% for all workloads. We observe that RAMZzz is alsomore efficient than RZ–SD on full system ED2 (or energyconsumption) in Appendix D of the supplementary file.

Finally, we study the impact of the number of availablelow-power states. In Table 5, we change the number ofavailable low-power states used on DDR3, DDR2 andLPDDR2 on M1 from 1 to 5, 1 to 4 and 1 to 3, respec-tively. We add a low-power state with smaller powerconsumption when increasing the number of availablelow-power states. As the number of available low-powerstates increasing, the normalized ED2 becomes smaller.The improvement in normalized ED2 by increasing thenumber of available low-power states from 1 to themaximum is 59.8%, 54.1% and 45.2% on DDR3, DDR2and LPDDR2, respectively. This further proves the self-adapting feature brought by our proposed adaptive de-motion scheme.

To summarize the impact of adaptive demotions, weobserve RAMZzz has much lower ED2 than RZ–SD on

three DRAM architectures, and with the reduction of26.4–51.1% (36.4% on average), 12.0–48.7% (25.0% onaverage) and 5.0–41.9% (22.4% on average) on DDR3,DDR2 and LPDDR2, respectively.

5.4 Results on PARSEC workloads

Figure 15 shows the normalized ED2 results of RAMZzzand ORACLE approaches on DDR3 architecture usingPARSEC workloads. We use the default experimentalsetting (e.g., the delay budget is 4%). RAMZzz is alsosignificantly more energy-efficient than BASE on PAR-SEC workloads. We observe similar results to those onthe SPEC 2006 workloads. For example, the reductionis more significant for the workloads with less inten-sive memory accesses (such as blackscholes). RAMZzzachieves a very close ED2 to ORACLE on PARSECworkloads (as shown in Figure 15).

5.5 Results on More Powerful Computer Systems

We also perform the simulation studies on more pow-erful computer systems with a larger number of CPUcores and large memory capacity. We use Sniper tocollect memory access traces of multi-threaded/multi-programmed workloads from SPEC 2006 and PARSECbenchmarks. With the default simulated CPU architec-ture, we run mixed workloads of 8 (and 16) appli-cations from SPEC 2006 and four PARSEC workloads(i.e., blackscholes, bodytrack, ferret and streamcluster)executed with 8 (and 16) threads on a 8-core (and 16-core) processor with 32 (and 64) GB DDR3 with 8 (and16) ranks memory system.

Figures 16 and 17 present the comparison of nor-malized ED2 of BASE, ORACLE, RAMZzz, RZ–SP andRZ–SD (SR FAST is used as the pre-selected low-powerstate) with the optimization goal of ED2 and with delaybudget of 4%. We make the following observations. First,RAMZzz has very significant ED2 reduction comparedwith BASE. The reduction in ED2 is 60.7% and 67.2% onaverage on the 8-core and 16-core systems, respectively.Second, RAMZzz achieves only 6.0% higher ED2 onaverage than ORACLE on all workloads and systems.For the impact of dynamic page migrations, RAMZzzhas an average ED2 reduction of 45.4% and 61.2% overRZ–SP on the 8-core and 16-core systems, respectively.For adaptive demotions, RAMZzz has an average ED2

reduction of 40.5% and 49.2% over RZ–SD on the 8-coreand 16-core systems, respectively.

6 CONCLUSION

In this paper, we have proposed a novel memory designRAMZzz to reduce the DRAM energy consumption.It embraces two rank-aware power saving techniquesto address the major obstacles in state transition-basedpower saving approaches: dynamic page migrations andadaptive demotions. A cost model is developed to guidethe optimizations for different workloads and different

14

blackscholes streamcluster bodytrack ferret0.0

0.1

0.2

0.3

0.4

0.5

0.6

Nor

mal

ized

ED

2 ORACLE RAMZzz

Fig. 15. The overall results of PAR-SEC workloads.

Mix1 (8)Mix2 (8)

blackscholes

streamcluster

bodytrackferret

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Nor

mal

ized

ED

2

RZ-SP RZ-SD (SR_FAST) ORACLE RAMZzz

Fig. 16. Results on a 8-core with32GB DDR3 system.

Mix1 (16)

Mix2 (16)

blackscholes

streamcluster

bodytrackferret

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Nor

mal

ized

ED

2

A

RZ-SP RZ-SD (SR_FAST) ORACLE RAMZzz

Fig. 17. Results on a 16-core with64GB DDR3 system.

memory architectures. We evaluate RAMZzz with SPEC2006 and PAESEC benchmarks in comparison with otherpower saving techniques on three main memory ar-chitectures including DDR3, DDR2 and LPDDR2. Oursimulation results demonstrate significant improvementin ED2 and energy consumption over other power savingtechniques. Moreover, RAMZzz performs very close tothe ideal oracle approach for different workloads andmemory architectures.

APPENDIX ADETAILED IMPLEMENTATION ISSUES

!"#$%&'($)*%$++"%'

,!-'./.01'

/20'34"4"1'

!56%78$)''

9"#7:'

;%$4:5)6'

<%=5*"%' ,$)*%$++"%'

-7*7:7*>'

9$48)6''

?''

,+$(@5)6'

)"*A$%@'

-9<!''

%7)@1'

B>&15(7+'

/)*"%C7("'

-"#$8$)'

B%"D5(8$)'#$D"+'

0*>"%'($#:$)")*1'5)'!!'1&1*"#'

0:"%78)6'1&1*"#'

'!E'

Fig. 18. Memory controller and operating system withRAMZzz’s new modules highlighted.

RAMZzz adds a few new components to the mem-ory controller and operating system. Following the pre-vious study [38], RAMZzz extends a programmablecontroller [39] by adding its own new components(shaded in Figure 18). Other functionalities includingpage grouping and the prediction model are offloadedto operating systems (like previous studies [5], [38]).

Memory Controller Structure. The memory controller(MC) receives read/write requests from the cache con-troller via the CMD FIFOs. The Arbiter dequeues re-quests from those FIFO queues, and the controller con-verts those requests into the necessary instructions andsequences required to communicate with the memory.The Datapath module handles the flow of reads andwrites between the memory devices. The physical inter-face converts the controller instructions into the actualtiming relationships and signals required for accessingthe memory device.

We assume the MC exploits page interleaving. Pageinterleaving exploits higher data locality but also makes

accesses to multiple banks less uniform, which maycause row-buffer conflicts in some cases. However, it isbetter for grouping and migrating physical pages basedon the frequency and recency of accesses, as describedin Section 4 (the hot ranks are more likely to have veryshort idle periods, and the cold ranks have relativelylong idle periods). A further optimization is to usethe permutation-based page interleaving scheme [44],which retains high data locality and reduces row-bufferconflicts. Other interleaving schemes, such as cache lineinterleaving (which maps consecutive cache lines todifferent memory banks), may not effectively exploitthe data locality. In such situation, a possible approachis to make migration decisions based on other localityindicators (e.g., row buffer misses as exploited by Wanget al. [45]) rather than LLC misses/writebacks. We leavethe optimization and evaluation on other interleavingschemes as our future work. Memory requests are han-dled on a FCFS basis. There are more sophisticatedaccess scheduling optimizations like finite queue length,critical-word-first optimization, and prioritizing reads.Those techniques are orthogonal to our study, and wedo not include them into the simulations.

Four new modules including MQ, Migration, Remapand Demotion are added into the memory controller forimplementing the functionality of page grouping, pagemigration, page remapping and power state control inRAMZzz, respectively. All the logics of the new mod-ules are performed by the memory controller, and aredesigned off the critical path of memory accesses, givingthe priority to the memory accesses from applications.We add a flag bit to indicate whether this request is fromapplications or new modules. The total on-chip storageof new MC modules in our design is 112KB (as describedin the following).

MQ Module. To avoid performance degradation, MQmodule contains the small on-chip cache (64KB with 4Kentries) to store the MQ structure and a separate queue(10KB) for the updates to the MQ structure. To find theMQ entry of a physical page, MC uses hashing with thecorresponding page number. Misses in the entry cacheproduce requests to DRAM. MQ module’s logic snoopsthe CMD FIFO queue, creating one update per newrequest. The updates to the MQ structure are performedby the MC off the critical path of memory accesses(via the aforementioned flag bit). The update queue is

15

implemented as a small circular buffer, where a newupdate precludes any currently queued update to thesame entry. In our design, each physical page descriptorin the MQ queues takes 124 bits. Each descriptor containsthe corresponding page number (22 bits), the referencecounter (14 bits), the queue number in MQ (4 bits), thelast-access time (27 bits), the pointers to other descriptors(54 bits), and the reserved bit for flags (3 bits). The spaceoverhead of our design is low. For the 2GB DRAM, thetotal space taken by the descriptors is about 8MB (only0.4% of the total DRAM space).

Migration Module. The Migration module containsthe queue of scheduled migrations. The migrations areenqueued in a manner such that concurrent migrationsof a Eulerian cycle are put in consecutive positions.At the beginning of each epoch, the OS accesses thecurrent MQ structure to perform grouping and calculatethe Eulerian cycle. Then, the OS updates the queueof scheduled migrations (10KB) which is stored in theMigration module. Page migrations start from the be-ginning of an epoch, and is scheduled once there areidle periods. Priority is given to longer segments becausethey involve more pages. Memory requests are buffereduntil the migration is concluded. To facilitate concurrentpage migrations according to the Eulerian cycle, eachrank is equipped with one extra row-buffer for storingthe incoming page. When migrating a page, a rank firstwrites the outgoing page to the buffer of the target rank,and then reads the incoming page from its buffer.

Remap Module. Similar to the previous design [38],we introduce a new layer of translation between physicaladdresses assigned by the OS (and stored in the OSpage table) and those used by the MC to access DRAMdevices. Specifically, the MC maintains the RemappingTable, a hash table for translating physical page addressescoming from the LLC to actual remapped physical pageaddresses. The OS can access the Remapping Table as well.After the migration is completed at the beginning ofan epoch, the Remapping Table is updated accordingly.Periodically or when the table fills up (at which pointthe MC interrupts the CPU), the OS commits the newtranslations to its page table and invalidates the cor-responding TLB entries. For example, if the OS uses ahashed inverted page table, e.g., UltraSparc and Pow-erPC architectures, it considerably simplifies the commitoperation. Then, the OS sets a flag in a memory-mappedregister in the MC to make sure that the MC preventsfrom migrating pages during the commit process, andclears the Remapping Table.

When a memory request (with physical address as-signed by the OS) arrives at the MC, it searches the ad-dress in the Remapping Table. On a hit, the new physicalpage address is used by the MC to issue the appropriatecommands to retrieve the data from its new location.Otherwise, the original address is used. In terms ofaccess latency, the remapping operation happens when arequest is added to the MC queues and does not extendthe critical path in the common case because queuing

delays at the MC are substantial. For memory-intensiveworkloads, memory requests usually wait in the MCqueues for a long time before being serviced. The abovetranslation can begin when the request is queued andthe delay for translation can be easily hidden behind thelong waiting time. The notion of introducing RemappingTable for the MC has been widely used in the past [20],[38], [40].

The Remap module maintains the Remapping Table(28KB with 4K entries) and the logic to remap targetaddresses. At the end of migration, the Migration mod-ule submits the migration information to the Remapmodule, which creates new mappings in the RemappingTable. The Remap module snoops the CMD queue tocheck if it is necessary to remap its entries. We assumeeach Remapping Table lookup and each remapping take 1memory cycle. However, these operations only delay amemory request if it finds the CMD queue empty (whichis not the common case). Note that the migration andremapping of a segment blocks the accesses to only thepages involved, and concurrent accesses to other pagesare still possible.

Demotion Module. The Demotion module performsthe demotion to control the power state of each rankaccording to its demotion configuration. The demotionconfiguration of each rank is updated by the OS at thebeginning of a slot.

OS Modules. Two major new components Groupingand Prediction Model are added to the memory man-agement sub-system in operating system. The Groupingmodule performs grouping and calculates the Euleriancycle according to the MQ structure at the beginningof an epoch. At the beginning of each epoch, the OSaccesses the current MQ structure to perform groupingand calculate the Eulerian cycle. Then, the OS updatesthe queue of scheduled migrations which is stored inthe Migration module. The Prediction Model moduleruns the prediction model and obtains the demotionconfiguration for memory controller at the beginning ofeach slot.

Note that the structure complexity and storage over-head of RAMZzz are similar to the previous proposals,e.g., [6], [9], [10], [29], [40]. For example, our design hassmall DRAM space requirement (less than 2% of the totalamount of DRAM).

APPENDIX BEXPERIMENTAL RESULTS ON MIXED WORK-LOADS

Figure 19 presents the experimental results of RAMZzzfor twenty additional mixed workloads with the op-timization goal of ED2 on DDR3. The memory traceof each mixed workload is collected by executing fourrandomly chosen SPEC 2006 applications concurrentlyon PTLSim. RAMZzz has much lower ED2 than BASE onall these workloads, with an average reduction of 46.2%,

16

Mix1

Mix2

Mix3

Mix4

Mix5

Mix6

Mix7

Mix8

Mix9

Mix10

Mix11

Mix12

Mix13

Mix14

Mix15

Mix16

Mix17

Mix18

Mix19

Mix20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Nor

mal

ized

ED

2

ORACLE RAMZzz

Fig. 19. Comparing ED2 of RAMZzz and ORACLE withthe optimization goal of ED2 on DDR3.

and a range of 30.5–63.9%. RAMZzz also achieves veryclose ED2 to ORACLE on all twenty workloads.

APPENDIX CRESULTS ON ENERGY-ORIENTED OPTIMIZA-TIONS OF SPEC 2006 WORKLOADS

In this section, we present results of SPEC 2006 work-loads when RAMZzz’s optimization metric is set to en-ergy consumption. We set a relatively high delay budget(10%, 2.5 times of that used in the ED2 experiment) tounleash the potential of energy saving. Other systemsettings of DRAM system and RAMZzz are the same asthose used in Section 5.2 (e.g., 2GB DRAM with 8 ranks).Figure 20 compares the energy consumption of RAMZzz,ORACLE and BASE on different memory architectures.Figure 21 shows the breakdown of time stayed in dif-ferent power states for RAMZzz on DDR3, DDR2 andLPDDR2. We make the following observations. Firstly,RAMZzz is still more energy-efficient than BASE. Thereduction on energy consumption is 66.9%, 65.8% and65.3% on average on DDR3, DDR2 and LPDDR2, re-spectively. The reduction is also more significant onthe workloads of single applications than the mixedworkloads. This is consistent with our observations inthe ED2 experiment.

RAMZzz consumes only 5.8%, 4.1% and 3.5% onaverage more energy than ORACLE on all workloadson DDR3, DDR2 and LPDDR2, respectively. Our studyon the power-down timeout shows that our predictionmodel is very close to ORACLE. RAMZzz achievesthe effectiveness and flexibility in different optimizationgoals. While the optimization metric is set to energyconsumption, the total delay of RAMZzz, includingremapping delay, migration delay and resynchronizationdelay, is less than 7.5% for all three DRAM architecturesas well as all workloads. Recall that our delay budgetfor energy-oriented optimizations is 10%.

We briefly present results of the individual impactof dynamic migrations and adaptive demotions onRAMZzz with the optimization goal of energy consump-tion. For dynamic page migrations, we observe RAMZzzhas an average reduction of 18.4%, 11.7% and 11.1% overRZ–SP, and with a range of 5.6–38.0%, 3.4–29.8% and2.1–30.0% on DDR3, DDR2 and LPDDR2, respectively.For adaptive demotions, we observe RAMZzz has thereduction of 29.7–53.8% (39.5% on average), 11.6–51.0%

(25.6% on average) and 5.2–43.7% (23.6% on average)over RZ–SD on DDR3, DDR2 and LPDDR2, respectively.

APPENDIX DSTUDIES ON FULL SYSTEM ED2 AND ENERGYCONSUMPTION

In this section, we evaluate the impact of RAMZzz onfull-system energy consumption and performance withSPEC 2006 workloads. We start by performing back-of-envelop calculations, following previous studies [29],[46]. We assume that the average power consumption ofthe DRAM system accounts for 40% of the total systempower in the baseline policy (i.e., BASE), and computea fixed average power estimate (i.e., the remaining 60%)for all other components. Thus, the energy consumptionof all other components (excluding DRAM) scales withthe program execution time, which is usually consis-tent with the real-world case [29], [46]. This ratio hasbeen identified as the current contribution of DRAMsystem to entire system power consumption [1], [47],[48]. We also study the impact of varying this ratio inthis evaluation. Architectural characteristics and experi-mental parameters for ED2-oriented and energy-orientedoptimizations are the same as those used in Section 5.2and Appendix C, respectively.

Figure 22 presents full system ED2 of RAMZzz, RZ–SP and RZ–SD (SR FAST is used as the pre-selectedlow-power state) when the optimization metric is setto ED2 on DDR3. All three approaches still outperformBASE on all workloads in terms of full system ED2.Compared with BASE, the reduction in full system ED2

is 23.0%, 18.0% and 17.8% on average for RAMZzz,RZ–SP and RZ–SD, respectively. RAMZzz outperformsboth RZ–SP and RZ–SD in full-system ED2, but leads toslightly higher performance degradations. We observethat RAMZzz has an average reduction of 4.8% (from1.6% to 17.9%) and 5.3% (from 1.7% to 8.6%) over RZ–SP and RZ–SD in full system ED2, respectively. Whenthe optimization metric is set to energy consumptionon DDR3, all three approaches are also energy-efficientthan BASE in full system energy as shown in Figure 23.RAMZzz outperforms both RZ–SP and RZ–SD, with theaverage reduction of 4.3% (from 1.4% to 13.8%) and 5.9%(from 1.8% to 9.0%) in full system energy, respectively.We observe similar results on other DRAM architectures.

We further study the ratio of power consumption ofthe memory subsystem to the overall power consump-tion of the full system. Particularly, we vary the ratiofrom 30% to 50%. Figure 24 shows that the fractionof memory power has a significant effect on both fullsystem ED2 and energy consumption. Increasing theratio from 30% to 50% (i.e., the power contribution ofother components are reduced from 70% to 50%), thenormalized full-system ED2 and energy consumption ofRAMZzz decrease from 0.84 to 0.70 and 0.83 to 0.68,respectively.

17

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

Ene

rgy

ORACLE RAMZzz

(a) Results on energy consumption for DDR3

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

Ene

rgy

ORACLE RAMZzz

(b) Results on energy consumption for DDR2

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

Ene

rgy

ORACLE RAMZzz

(c) Results on energy consumption forLPDDR2

Fig. 20. Comparing energy consumption of RAMZzz and ORACLE with the optimization goal of energy consumptionon three memory architectures.

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time


(a) The breakdown of time for DDR3

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time

Others SR PRE_PDN ACT_PDN_SLOW ACT_PDN_FAST ACT

(b) The breakdown of time for DDR2

S1 S2 S3 S4 M1 M2 M3 M4 M5 M60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time

Others SR PRE_PDN ACT_PDN ACT

(c) The breakdown of time for LPDDR2

Fig. 21. The breakdown of time stayed in different power states for RAMZzz with the optimization goal of energyconsumption on three memory architectures.

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nor

mal

ized

Ful

l Sys

tem

ED

2

RZ-SP RS-SD (SR_FAST) RAMZzz

Fig. 22. Comparing full-system ED2

with the optimization goal of ED2 onDDR3.

S1 S2 S3 S4 GM M1 M2 M3 M4 M5 M60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nor

mal

ized

Ful

l Sys

tem

Ene

rgy

RZ-SP RS-SD (SR_FAST) RAMZzz

Fig. 23. Comparing full-system en-ergy with the optimization goal ofenergy consumption on DDR3.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 30% (Energy) 40% (Energy) 50% (Energy)

Normalized System EnergyNormalized System ED2

Nor

mal

ized

Ful

l Sys

tem

ED

2 / E

nerg

y 30% (ED2) 40% (ED2) 50% (ED2)

Fig. 24. The impact of the ratioof memory power to total systempower.

APPENDIX ESENSITIVITY STUDIES

We use ED2 as the optimization metric, DDR3 as thetarget memory architecture and SPEC 2006 workloadsto conduct the sensitivity analysis. Since RAMZzz isvery close to ORACLE, we present results for RAMZzz,RZ–SP and RZ–SD (choosing PRE PDN SLOW as thetarget low-power state) only. In those studies, we varyone parameter at a time and keep other parameters intheir default settings. Due to the space limitation, wepresent the figures for M4 (a modest case among all thoseworkloads) and comment on other workloads withoutfigures when appropriate.

DRAM parameters. We study the impact of differentnumbers of ranks and memory capacities of DRAM.As the number of ranks increases, we observe a ratherstable ED2 for RAMZzz, RZ–SD and RZ–SP. For all threeapproaches, when the number of ranks increases from 2

to 4, ED2 drops less than 1%, because of a finer grainedpower control on ranks. When the number of ranksincreases from 4, 8 to 16, ED2 increases less than 3%.The major reason for increasing ED2 is the increasedamount of page migrations caused by increasing num-ber of ranks. Figure 25 shows the results for varyingthe memory capacity. As memory capacities increase,all three methods achieve a lower ED2, and the ED2

improvement of RAMZzz over RZ–SD and RZ–SP bothbecomes larger. That indicates the effectiveness of ourapproach on larger-memory systems.

RAMZzz parameters. We study the impact of differentepoch/slot sizes and delay budgets of RAMZzz.

Figure 26 shows the results of varying epoch size. RZ–SP is not sensitive to the epoch size, whereas the ED2

of RAMZzz and RZ–SD both increases slightly. That isbecause, for a longer epoch, the rank hotness does notaffect the changes in page access locality in time, and the

18

0.830.77

0.720.76

0.63

0.52

0.73

0.58

0.45

1 2 40.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nor

mal

ized

ED

2

Capacity (GB)

RZ-SP RZ-SD RAMZzz

Fig. 25. Comparing ED2 with vary-ing DRAM capacity on M4 on DDR3.

0.77 0.77 0.77 0.77

0.63 0.63 0.64 0.650.57 0.58 0.59 0.60

5 10 15 200.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nor

mal

ized

ED

2

Epoch (10^8 cycles)

RZ-SP RZ-SD RAMZzz

Fig. 26. Comparing ED2 with vary-ing epoch size on M4 on DDR3.

0.980.92

0.87

0.770.760.67 0.65 0.63

0.740.65 0.62

0.58

0.0008 0.008 0.02 0.040.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Nor

mal

ized

ED

2

Delay budget

RZ-SP RZ-SD RAMZzz

Fig. 27. Comparing ED2 with vary-ing delay budget on M4 on DDR3.

ED2 improvement brought by page migrations is slightlyreduced. We observed a similar result when varying theslot size in (0.125 × 108 × 2i) cycles (i =0, 1, ..., 3). TheED2 of RAMZzz varies by less than 2%. In practice, weset the slot size to be 108 cycles, and the epoch size to be109 cycles as a compromise on the prediction overheadand the accuracy.

Figure 27 compares ED2 for varying delay budget. Asmall delay budget limits the potential for energy saving,whereas a large delay budget leads to too aggressiveenergy saving and exaggerates the delay incurred bymiss-predictions. In practice, we set the delay budgetwithin 1–4% for optimizing ED2.

APPENDIX FSTUDIES ON MQ STRUCTURE

Figure 28 shows physical page access frequencies forpages stored in different levels of the MQ structure dur-ing an epoch after page migrations of RAMZzz on DDR3architecture and M4 workload. We observe that pagesstored in high MQ levels have higher access frequencies,while pages stored in low MQ levels have lower accessfrequencies. This shows that the MQ structure actuallycorrelates with page access distribution. We have con-sistent observations on other workloads and memoryarchitectures.

APPENDIX GSTUDIES ON PREDICTION OF IDLE PERIOD DIS-TRIBUTION

We compare the predicted idle histogram to the actualidle histogram of RAMZzz on Rank 0 on DDR3 architec-ture for a selected workload (M4) in this section. The pre-dicted histogram is close to the actual histogram in ourevaluation in both cases: 1) the slot is not the beginningof an epoch as shown in Figure 29 (in this case, the actualhistogram in the previous slot is used as the predictionof the current slot); 2) the slot is the beginning of anepoch as shown in Figure 30 (in this case, the algorithmdeveloped in Section 4.2 is used to estimate the predictedhistogram). We have observed similar results on otherworkloads and memory architectures.

We also find that the confidence of the regressioncurve being an exponential function (i.e., R2) on the

actual histogram of idle periods is high (0.94 and 0.92 inFigures 29 and 30, respectively). Thus, our assumptionthat the inter-arrival times of memory requests follow aPoisson distribution is valid.

REFERENCES

[1] U. Hoelzle and L. A. Barroso, The Datacenter as a Computer:An Introduction to the Design of Warehouse-Scale Machines, 1st ed.Morgan and Claypool Publishers, 2009.

[2] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, andT. W. Keller, “Energy management for commercial servers,” IEEEComputer, vol. 36, no. 12, 2003.

[3] D. Meisner, B. T. Gold, and T. F. Wenisch, “Powernap: eliminatingserver idle power,” in ASPLOS ’09, 2009.

[4] M. S. Ware, K. Rajamani, M. S. Floyd, B. Brock, J. C. Rubio, F. L. R.III, and J. B. Carter, “Architecting for power management: Theibm power7 approach,” in HPCA ’10, 2010.

[5] H. Huang, P. Pillai, and K. G. Shin, “Design and implementationof power-aware virtual memory,” in USENIX ATC ’03, 2003.

[6] H. Huang, K. G. Shin, C. Lefurgy, and T. Keller, “Improvingenergy efficiency by making dram less randomly accessed,” inISLPED ’05, 2005.

[7] V. Delaluz, A. Sivasubramaniam, M. Kandemir, N. Vijaykrishnan,and M. J. Irwin, “Scheduler-based dram energy management,” inDAC ’02, 2002.

[8] C. Bae and T. Jamel, “Energy-aware memory managementthrough database buffer management,” in WEED ’11, 2011.

[9] X. Fan, C. Ellis, and A. Lebeck, “Memory controller policies fordram power management,” in ISLPED ’01, 2001.

[10] B. Diniz, D. Guedes, W. Meira, Jr., and R. Bianchini, “Limiting thepower consumption of main memory,” in ISCA ’07, 2007.

[11] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. disser-tation, Princeton University, January 2011.

[12] A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis, “Power aware pageallocation,” in ASPLOS ’00, 2000.

[13] H. Zheng and Z. Zhu, “Power and performance trade-offs incontemporary dram system designs for multicore processors,”IEEE Trans. on Comput., vol. 59, no. 8, 2010.

[14] K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periy-athambi, and M. Horowitz, “Towards energy-proportional data-center memory with mobile dram,” in ISCA ’12, 2012.

[15] Micron Tech., Inc., MT41J256M4JP-15E Datasheet, 2010.[16] Micron Tech., Inc., MT47H128M8CF-25E Datasheet, 2007.[17] Micron Tech., Inc., MT42L128M32D1LF-25WT Datasheet, 2011.[18] Micron Tech., Inc., System Power Calculator,

http://www.micron.com/products/support/power-calc, 2012.[19] I. Hur and C. Lin, “A comprehensive approach to dram power

management,” in HPCA ’08, 2008.[20] K. Sudan, K. Rajamani, W. Huang, and J. Carter, “Tiered memory:

An iso-power memory architecture to address the memory powerwall,” IEEE Trans. on Comput., vol. 61, no. 12, 2012.

[21] V. Delaluz, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin,“Energy-oriented compiler optimizations for partitioned memoryarchitectures,” in CASES ’00, 2000.

[22] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam,and M. J. Irwin, “Dram energy management using software andhardware directed power mode control,” in HPCA ’01, 2001.

19

!"

#!!"

$!!!"

$#!!"

%!!!"

%#!!"

&!!!"

&#!!"

$" %" &" '" #" (" )" *" +" $!" $$" $%" $&" $'" $#" $("

Av

g. P

ag

e A

cc

es

s F

req

ue

nc

ies

MQ Level

Fig. 28. The average of page accessfrequencies for different MQ levels.

0

5000

10000

15000

20000

25000

30000

!"

#!#"

$%&'"

()()"

)$)$"

)*)*"

'&'&"

%'%'"

&(&("

+!+!"

+#+#"

#&#&"

*'*'"

$!'!)"

$$'$)"

$(+(&"

$&+&&"

%($$&"

Fre

qu

en

cy

Idle Period Length

Actual Histogram Predicted Histogram

Fig. 29. Comparing the actual andpredicted idle histogram of a slot thatis not the start of an epoch.

0

20000

40000

60000

80000

100000

120000

140000

!"

#!#"

$%$%"

$&$&"

%'%'"

(!(!"

(#(#"

'%'%"

'&'&"

)')'"

#%#%"

#*#*"

++++"

&#&#"

*&*&"

$$)$'"

Fre

qu

en

cy

Idle Period Length

Actual Histogram Predicted Histogram

Fig. 30. Comparing the actual andpredicted idle histogram of a slot thatis the start of an epoch.

[23] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu,“Mini-rank: Adaptive dram architecture for improving memorypower efficiency,” in MICRO ’41, 2008.

[24] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S.Schreiber, “Future scaling of processor-memory interfaces,” in SC’09, 2009.

[25] H. Zheng, J. Lin, Z. Zhang, and Z. Zhu, “Decoupled dimm:Building high-bandwidth memory system using low-speed dramdevices,” in ISCA ’09, 2009.

[26] M. Bi, R. Duan, and C. Gniady, “Delay-hiding energy manage-ment mechanisms for dram,” in HPCA ’10, 2010.

[27] E. Cooper-Balis and B. Jacob, “Fine-grained activation for powerreduction in dram,” IEEE Micro, vol. 30, 2010.

[28] H. David, C. Fallin, E. Gorbatov, U. R. Hanebutte, and O. Mutlu,“Memory power management via dynamic voltage/frequencyscaling,” in ICAC ’11, 2011.

[29] Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, and R. Bianchini,“Memscale: active low-power modes for main memory,” in ASP-LOS ’11, 2011.

[30] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “Atlas: Ascalable and high-performance scheduling algorithm for multiplememory controllers,” in HPCA ’10, 2010.

[31] A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramo-nian, A. Davis, and N. P. Jouppi, “Rethinking dram design andorganization for energy-constrained multi-cores,” in ISCA ’10,2010.

[32] B. He, Q. Luo, and B. Choi, “Cache-conscious automata for xmlfiltering,” IEEE Trans. on Knowl. and Data Eng., vol. 18, no. 12,2006.

[33] B. He and Q. Luo, “Cache-oblivious databases: Limitations andopportunities,” ACM Trans. Database Syst., vol. 33, no. 2, 2008.

[34] B. He and Q. Luo, “Cache-oblivious query processing,” in CIDR’07, 2007.

[35] K. Kumar, K. Doshi, M. Dimitrov, and Y.-H. Lu, “Memory en-ergy management for an enterprise decision support system,” inISLPED ’11, 2011.

[36] D. Wu, B. He, X. Tang, J. Xu, and M. Guo, “Ramzzz: rank-aware dram power management with dynamic migrations anddemotions,” in SC ’12, 2012.

[37] Y. Zhou, J. Philbin, and K. Li, “The multi-queue replacementalgorithm for second level buffer caches,” in USENIX ATC ’01,2001.

[38] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement inhybrid memory systems,” in ICS ’11, 2011.

[39] Xilinx, Inc., Spartan-6 FPGA Memory Controller User Guide, 2010.[40] K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubra-

monian, and A. Davis, “Micro-pages: increasing dram efficiencywith locality-aware data placement,” in ASPLOS ’10, 2010.

[41] M. T. Yourst, “Ptlsim: A cycle accurate full system x86-64 microar-chitectural simulator,” in ISPASS ’07, 2007.

[42] T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploringthe level of abstraction for scalable and accurate parallel multi-core simulations,” in SC ’11), 2011.

[43] X. Guo, E. Ipek, and T. Soyata, “Resistive computation: Avoidingthe power wall with low-leakage, stt-mram based computing,” inISCA ’10, 2010.

[44] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based pageinterleaving scheme to reduce row-buffer conflicts and exploitdata locality,” in MICRO ’00, 2000.

[45] B. Wang, B. Wu, D. Li, X. Shen, W. Yu, Y. Jiao, and J. S. Vetter,“Exploring hybrid memory for gpu energy efficiency throughsoftware-hardware co-design,” in PACT ’13, 2013.

[46] N. Chatterjee, M. Shevgoor, R. Balasubramonian, A. Davis,Z. Fang, R. Illikkal, and R. Iyer, “Leveraging heterogeneity indram main memories to accelerate critical word access,” in MI-CRO ’45, 2012.

[47] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich,D. Mazieres, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum,S. M. Rumble, E. Stratmann, and R. Stutsman, “The case forramclouds: scalable high-performance storage entirely in dram,”SIGOPS Oper. Syst. Rev., vol. 43, no. 4, 2010.

[48] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah, “Analyzing theenergy efficiency of a database server,” in SIGMOD ’10, 2010.

Documents

Rank-Aware Dynamic Migrations and Adaptive Demotions for ... · periods for power state transitions. We further develop adaptive state demotions by considering all low-power states