14
Tiered Memory: An Iso-Power Memory Architecture to Address the Memory Power Wall Kshitij Sudan, Student Member, IEEE, Karthick Rajamani, Senior Member, IEEE, Wei Huang, Member, IEEE, and John B. Carter, Senior Member, IEEE Abstract—Moore’s Law improvement in transistor density is driving a rapid increase in the number of cores per processor. DRAM device capacity and energy efficiency are increasing at a slower pace, so the importance of DRAM power is increasing. This problem presents system designers with two nominal options when designing future systems: 1) decrease off-chip memory capacity and bandwidth per core or 2) increase the fraction of system power allocated to main memory. Reducing capacity and bandwidth leads to imbalanced systems with poor processor utilization for noncache-resident applications, so designers have chosen to increase DRAM power budget. This choice has been viable to date, but is fast running into a memory power wall. To address the looming memory power wall problem, we propose a novel iso-power tiered memory architecture that supports 2 3X more memory capacity for the same power budget as traditional designs by aggressively exploiting low-power DRAM modes. We employ two “tiers” of DRAM, a “hot” tier with active DRAM and a “cold” tier in which DRAM is placed in self-refresh mode. The DRAM capacity of each tier is adjusted dynamically based on aggregate workload requirements and the most frequently accessed data are migrated to the “hot” tier. This design allows larger memory capacities at a fixed power budget while mitigating the performance impact of using low-power DRAM modes. We target our solution at server consolidation scenarios where physical memory capacity is typically the primary factor limiting the number of virtual machines a server can support. Using iso-power tiered memory, we can run 3 as many virtual machines, achieving a 250 percent improvement in average aggregate performance, compared to a conventional memory design with the same power budget. Index Terms—Memory power management, memory power wall, DRAM memory systems, DRAM low power modes, virtual machine consolidation, DRAM data allocation. Ç 1 INTRODUCTION T O avoid performance bottlenecks, computer architects strive to build systems with balanced compute, mem- ory, and storage capabilities. Until recently, the density and energy efficiency of processing, memory, and storage components tended to improve at roughly the same rate from one technology generation to the next, but this is no longer the case due primarily to power and thermal considerations [2]. Main memory capacity and bandwidth are fast becom- ing dominant factors in server performance. Due largely to power and thermal concerns, the prevailing trend in processor architecture is to increase core counts rather than processor frequency. Multicore processors place tremen- dous pressure on memory system designers to increase main memory capacity and bandwidth at a rate propor- tional to the increase in core counts. The pressure to grow memory capacity and bandwidth is particularly acute for mid to high-end servers, which often are used to run memory-intensive database and analytics applications or to consolidate large numbers of memory-hungry virtual machine (VM) instances. Server virtualization is increasingly popular because it can greatly improve server utilization by sharing physical server resources between VM instances. In practice, virtualization shares compute resources effectively, but main memory is difficult to share effectively. As a result, the amount of main memory needed per processor is growing at or above the rate of increase in core counts. Unfortunately, DRAM device density is improving at a slower rate than per- processor core counts, and per-bit DRAM power is improving even slower. The net effect of these trends is that servers need to include more DRAM devices and allocate an increasing proportion of the system power budget to memory to keep pace with the server usage trends and processor core growth. 1.1 Memory Power Wall Since DRAM energy efficiency improvements have not kept pace with the increasing demand for memory capacity, servers have had to significantly increase their memory subsystem power budgets [7]. For example, Ware et al. [28] report that in the shift from IBM POWER6-processor-based servers to POWER7-based servers, the processor power budget for a representative high-end server shrank from 53 to 41 percent of total system power. At the same time, the memory power budget increased from 28 to 46 percent. Most IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012 1 . K. Sudan is with the University of Utah, 50 S. Central Campus Drive, Room 3190, Salt Lake City, UT 84112. E-mail: [email protected]. . K. Rajamani, W. Huang, and J.B. Carter are with IBM Austin Research Laboratory, 11501 Burnet Road, Austin, TX 78758. E-mail: {karthick, huangwe, retrac}@us.ibm.com. Manuscript received 5 Nov. 2011; revised 22 Mar. 2012; accepted 22 Apr. 2012; published online 30 May 2012. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCSI-2011-11-0869. Digital Object Identifier no. 10.1109/TC.2012.119. 0018-9340/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, …kshitij/temp/tcsi-2011-11-0869-2.pdfIndex Terms—Memory power management, memory power wall, DRAM memory systems, DRAM low power

Embed Size (px)

Citation preview

Tiered Memory: An Iso-PowerMemory Architecture to Address

the Memory Power WallKshitij Sudan, Student Member, IEEE, Karthick Rajamani, Senior Member, IEEE,

Wei Huang, Member, IEEE, and John B. Carter, Senior Member, IEEE

Abstract—Moore’s Law improvement in transistor density is driving a rapid increase in the number of cores per processor. DRAM

device capacity and energy efficiency are increasing at a slower pace, so the importance of DRAM power is increasing. This problem

presents system designers with two nominal options when designing future systems: 1) decrease off-chip memory capacity and

bandwidth per core or 2) increase the fraction of system power allocated to main memory. Reducing capacity and bandwidth leads to

imbalanced systems with poor processor utilization for noncache-resident applications, so designers have chosen to increase DRAM

power budget. This choice has been viable to date, but is fast running into a memory power wall. To address the looming memory

power wall problem, we propose a novel iso-power tiered memory architecture that supports 2� 3X more memory capacity for the

same power budget as traditional designs by aggressively exploiting low-power DRAM modes. We employ two “tiers” of DRAM, a “hot”

tier with active DRAM and a “cold” tier in which DRAM is placed in self-refresh mode. The DRAM capacity of each tier is adjusted

dynamically based on aggregate workload requirements and the most frequently accessed data are migrated to the “hot” tier. This

design allows larger memory capacities at a fixed power budget while mitigating the performance impact of using low-power DRAM

modes. We target our solution at server consolidation scenarios where physical memory capacity is typically the primary factor limiting

the number of virtual machines a server can support. Using iso-power tiered memory, we can run 3� as many virtual machines,

achieving a 250 percent improvement in average aggregate performance, compared to a conventional memory design with the same

power budget.

Index Terms—Memory power management, memory power wall, DRAM memory systems, DRAM low power modes, virtual machine

consolidation, DRAM data allocation.

Ç

1 INTRODUCTION

TO avoid performance bottlenecks, computer architectsstrive to build systems with balanced compute, mem-

ory, and storage capabilities. Until recently, the density andenergy efficiency of processing, memory, and storagecomponents tended to improve at roughly the same ratefrom one technology generation to the next, but this is nolonger the case due primarily to power and thermalconsiderations [2].

Main memory capacity and bandwidth are fast becom-ing dominant factors in server performance. Due largelyto power and thermal concerns, the prevailing trend inprocessor architecture is to increase core counts rather thanprocessor frequency. Multicore processors place tremen-dous pressure on memory system designers to increasemain memory capacity and bandwidth at a rate propor-tional to the increase in core counts.

The pressure to grow memory capacity and bandwidthis particularly acute for mid to high-end servers, which

often are used to run memory-intensive database andanalytics applications or to consolidate large numbers ofmemory-hungry virtual machine (VM) instances. Servervirtualization is increasingly popular because it can greatlyimprove server utilization by sharing physical serverresources between VM instances. In practice, virtualizationshares compute resources effectively, but main memory isdifficult to share effectively. As a result, the amount ofmain memory needed per processor is growing at or abovethe rate of increase in core counts. Unfortunately, DRAMdevice density is improving at a slower rate than per-processor core counts, and per-bit DRAM power isimproving even slower. The net effect of these trends isthat servers need to include more DRAM devices andallocate an increasing proportion of the system powerbudget to memory to keep pace with the server usagetrends and processor core growth.

1.1 Memory Power Wall

Since DRAM energy efficiency improvements have not keptpace with the increasing demand for memory capacity,servers have had to significantly increase their memorysubsystem power budgets [7]. For example, Ware et al. [28]report that in the shift from IBM POWER6-processor-basedservers to POWER7-based servers, the processor powerbudget for a representative high-end server shrank from 53to 41 percent of total system power. At the same time, thememory power budget increased from 28 to 46 percent. Most

IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012 1

. K. Sudan is with the University of Utah, 50 S. Central Campus Drive,Room 3190, Salt Lake City, UT 84112. E-mail: [email protected].

. K. Rajamani, W. Huang, and J.B. Carter are with IBM Austin ResearchLaboratory, 11501 Burnet Road, Austin, TX 78758.E-mail: {karthick, huangwe, retrac}@us.ibm.com.

Manuscript received 5 Nov. 2011; revised 22 Mar. 2012; accepted 22 Apr.2012; published online 30 May 2012.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCSI-2011-11-0869.Digital Object Identifier no. 10.1109/TC.2012.119.

0018-9340/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

servers today are forced to restrict the speed of their highestcapacity memory modules to accommodate them within theserver’s power budget [13]. This trend is not sustainable—overall system power budgets have remained constant ordecreased and many data centers already operate close tofacility power and/or server line-cord power limits. Thesepower constraints are now the primary factor limitingmemory capacity and performance, which in turn limitsserver performance. We call this emerging problem thememory power wall. To overcome this memory power wall,system designers must develop methods to increase memorycapacity and bandwidth for a fixed power budget.

We propose to attack the problem by exploiting DRAMlow-power, idle modes. DDR3 DRAM supports active power-down, fast precharge power-down, slow precharge power-down,and self-refresh [1] modes. Each mode progressively reducesDRAM device power by gating additional componentswithin the device, but increases the latency to return to theactive state and start servicing requests. This introduces atradeoff between idle mode power reduction and potentialincrease in access latency.

DRAM low-power modes are not aggressively exploiteddue to two factors: 1) the coarse granularity at which memorycan be put in a low-power mode, and 2) application memoryaccess patterns. Most DRAM-based memory systems followthe JEDEC standard [20]. Memory is organized as modules(DIMMs) composed of multiple DRAM devices. When aprocessor requests a cache line of data from main memory,the physical address is used to select a channel, then a DIMM,and finally a rank within the DIMM to service the request. Alldevices in a rank work in unison to service each request, sothe smallest granularity at which memory can be put into alow-power mode is a rank. For typical systems built using4 GB DIMMs, a rank is 2 GB in size, a substantial fraction ofmain memory even for a large server.

The large granularity at which memory can be placedinto low-power modes is a serious problem because of theincreasingly random distribution of memory accesses acrossranks. To increase memory bandwidth, most serversinterleave consecutive cache lines in the physical addressspace across ranks. Further, without a coordinated effort bymemory managers at all levels in the system (library andoperating system), application data are effectively randomlyallocated across DRAM ranks. Virtualization only exacer-bates this problem [19] because it involves yet anothermemory management entity, the hypervisor. Consequently,frequently accessed data tend to be spread across ranks, sono rank experiences sufficient idleness to warrant beingplaced in a low-power mode. As a result, few commercialsystems exploit DRAM low-power modes and those that dotypically only use the deep low-power modes on nearly idleservers.

1.2 An Iso-Power Tiered Memory Design

To address the memory wall problem, we propose an iso-power tiered memory architecture that exploits DRAM low-power modes by creating two tiers of DRAM (hot and cold)with different power and performance characteristics. Thetiered memory configuration has the same total memorypower budget (iso-power) but allows for larger totalcapacity. The hot tier is comprised of DRAM in the activeor precharge standby (full power) mode when idle, while thecold tier is DRAM in self-refresh (low power) mode when

idle. The cold tier retains memory contents and can servicememory requests, but memory references to ranks in thecold tier will experience higher access latencies. To optimizeperformance, we dynamically migrate data between tiersbased on their access frequency. We can also dynamicallyadjust the amount of DRAM in the hot and cold tiers basedon aggregate workload memory requirements.

If we are able to keep most of main memory in a low-power mode, we can build servers with far more memorythan traditional designs for the same power budget. An idlerank of DRAM in self-refresh mode consumes roughly one-sixth of the power of an idle rank in active-standby mode, soan iso-power tiered memory design can support up to sixranks in self-refresh mode or one rank in active-standbymode for the same power budget. The actual ratio of coldranks per hot rank is dependent on the distribution ofmemory requests to the hot and cold tiers, and is derived inSection 3.1.1. For example, the same power budget cansupport eight hot ranks and no cold ranks (16 GB), four hotranks, and 12 cold ranks (32 GB), or two hot ranks and22 cold ranks (48 GB). Which organization is optimal isworkload dependent. For workloads with small aggregateworking sets and/or high sensitivity to memory accesslatency, the baseline configuration (16 GB of DRAM in theactive mode) may perform best. For workloads with largeaggregate working sets, an organization that maximizes totalmemory capacity (e.g., 4 GB in active mode and 44 GB in self-refresh mode) may perform best. VM consolidation work-loads typically are more sensitive to memory capacity thanmemory latency, and thus are likely to benefit greatly fromthe larger capacity tiered configurations. Our iso-power tieredmemory architecture exploits this ability to trade off memoryaccess latency for memory capacity under a fixed power budget.

A key enabler of our approach is that for many applica-tions, a small fraction of the application’s memory footprintaccounts for most of its main memory accesses. Thus, if hotdata can be identified and migrated to the hot tier, and colddata to the cold tier, we can achieve the power benefits oftiering without negatively impacting performance.

To ensure hot and cold data are placed in theappropriate tier, we modify the operating system or thehypervisor to track access frequency. We divide executioninto 20M-100M long instruction epochs and track howoften each DRAM page is accessed during each epoch. Atthe end of each epoch, system software migrates data to theappropriate tier based on its access frequency.

To limit the latency and energy overhead of entering andexiting low-power modes, we do not allow memory requeststo wake up cold ranks arbitrarily. Instead, we delay servicingrequests to cold ranks when they are in self-refresh mode,which allows multiple requests to a cold rank to be queuedand handled in a burst when it is made (temporarily) active.To avoid starvation, we set an upper bound on how long arequest can be delayed. The performance impact of queuingrequests depends on both the arrival rate of requests to ranksin self-refresh mode and the latency sensitivity of theparticular workload. Our primary focus is on mid to high-end servers used as virtualization platforms. In this environ-ment, memory capacity is generally more important thanmemory latency, since DRAM capacity tends to be theprimary determinant of how many VMs can be run on asingle physical server. Our experiments, discussed in

2 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012

Section 4, indicate that occasionally queuing memoryrequests decreases individual application performance byless than 5 percent, while being able to support 3X as muchtotal DRAM, increasing aggregate server throughput by 2.2-2.9X under a fixed power budget.

The main contributions of this paper are the following:

. We propose a simple, adaptive two-tier DRAMorganization that can increase memory capacity ondemand within a fixed memory power budget.

. We demonstrate how variation in page accessfrequency can be exploited by the tiered memoryarchitecture.

. We analyze the proposed iso-power tiered memoryarchitecture to quantify its impact on server memorycapacity and performance.

2 MEMORY ACCESS CHARACTERISTICS OF

WORKLOADS

Our iso-power tiered memory architecture identifies hot/cold pages in an application’s address space, migrates theminto the appropriate tiers, and exploits this differentiatedaccess characteristics to increase total memory capacity. Thisis only feasible if applications exhibit strong access locality,wherein a relatively small fraction of memory is responsiblefor a large fraction of dynamic accesses. To determine if thatis so, we first examined a set of applications from thePARSEC suite using their native inputs. We instrumented theOS to track references to memory blocks every 10 msecs at a128 MB granularity—the first access to a block in a 10 msinterval would increment the reference count for that block.Fig. 1a presents a cumulative distribution function (CDF) ofthe block reference counts showing the fraction of workloadfootprint needed to cover a given fraction of block referencecounts for each workload. While this data are collected at afairly coarse spatial (128 MB) and temporal granularity(10 msec), it is evident that there is substantial memory accesslocality. In all cases, 10-20 percent of each application’smemory accounts for nearly all of its accesses, which issufficient locality for iso-powered tiered memory to beeffective.

To determine whether tracking reference frequency at asmall granularity would demonstrate similar locality, were-ran our experiments using an event-driven systemsimulator (described in detail in Section 4.2) and trackedaccesses at a 4 KB OS page granularity. Fig. 1b plots theresulting memory access CDFs for seven applications. Thefiner granularity of simulator-measured reference localitydata shows a similar spatial locality in the memory regionreferences as we found with the coarse-grained measure-ments on existing hardware.

Fig. 2 shows the distribution of accesses to individual4 KB pages sorted from the most frequently accessed pageto the least, for two of the most memory-intensive PARSECand SPEC CPU2006 benchmarks, canneal and libquan-

tum. Both applications exhibit traditional “stair step”locality, with sharp knees in their access CDF graphs thatindicate the size of individual components of eachapplication’s working set. The cumulative working set sizeswhere steep transitions occur are good candidate hot tiersizes if our hot/cold data migration mechanism caneffectively migrate hot/cold data to the appropriate tier.

3 IMPLEMENTATION

Tiered memory is a generalizable memory architecture thatleverages the heterogeneous power-performance character-istics of each tier. For instance, there is a large body ofrecent work on hybrid memory systems that mix multiplememory technologies (e.g., DRAM and PCM) [4], [8]. Ourwork introduces the notion of DRAM-only tiered designs,which can be implemented with existing memory technol-ogies. We use the self-refresh mode, which provides themaximum reduction in idle DRAM power, but also thehighest exit latency, to create our cold tier of memory.While we could in theory support more than two tiers, theincremental benefit of using higher power idle modes ismarginal and increases implementation complexity.

3.1 Implementing Tiered Memory in a DRAMMemory System

We implement iso-powered tiered memory by dividingmemory into two tiers, a hot tier in the active state and a

SUDAN ET AL.: TIERED MEMORY: AN ISO-POWER MEMORY ARCHITECTURE TO ADDRESS THE MEMORY POWER WALL 3

Fig. 1. Cumulative access distribution of DRAM memory requests for different workloads.

cold tier in self-refresh mode. To balance the goal ofmaximizing power savings while mitigating the perfor-mance impact, we can adjust the size of the hot and coldtiers based on the size and locality of the working sets ofrunning applications. For some application mixes, optimaliso-power performance is achieved by placing as manyranks as possible given the power budget in the hot tier,and turning off the remainder of DRAM. Referring to ourexample from Section 1, this would be the organizationwith eight ranks (16 GB) in the hot tier and no ranks in thecold tier. Other application mixes might have largefootprints but very high spatial locality, and thus achievethe best performance by prioritizing capacity, e.g., the twohot rank (4 GB) and 22 cold rank (44 GB) example fromSection 1.

In our proposed design, servers are configured withenough DIMMs to store the maximum-sized iso-powerconfiguration supported, which in the example abovewould be 48 GB. For the purposes of this paper, we assumethat these ranks (DIMMs) are allocated evenly acrossavailable memory channels. The operating system (orhypervisor) determines how many ranks should be in eachpower mode given the memory behavior of the currentapplication mix, subject to a total memory system powerbudget. Since we only consider one application mix at atime, we do not dynamically resize the tiers midrun, butrather evaluate the performance of several iso-power hot/cold tier configurations. In Section 5, we discuss otherpossible organizations, e.g., modifying the number ofmemory channels or memory controllers, but they do nothave a substantial impact on our conclusions.

3.1.1 Iso-Power Tiered Memory Configurations

As a first step, we need to determine what mix of hot and coldranks can be supported for a given power budget. Assume abaseline (all hot ranks) design with N ranks. An iso-powertiered configuration can support a mix of Nhot hot ranks andNcold cold ranks (Nhot � N � Nhot þNcold). Whenever weadjust the number of active ranks, we do so evenly across allmemory channels and memory controllers, so the number ofhot/cold/off ranks per memory channel increases/de-creases in direct proportion to changes in tier size andaggregate capacity. This design imposes the fewest changesto current memory controller designs, but other designs arediscussed in Section 5.

Note that the number of hot/cold ranks is not the onlyfactor that impacts power consumption—power is alsoconsumed by actual memory operations. To avoid exceed-ing the power budget, we employ well-known memorythrottling techniques [13] to limit the rate of accesses to thememory system. We can adjust the active power budgetup/down by decreasing/increasing the memory throttlerate. The power allocated for memory accesses impacts thenumber of hot and cold ranks that can be supported for agiven power budget. Note that no actual throttling wasneeded during our evaluations as the aggregate bandwidthdemand for the memory tiers was low enough to be metwithin the memory system power budget.

We compute iso-power tiered memory configurations

based on the desired peak memory request rate (rpeak),

average service time (tsvc), the idle power of hot (phot) and

cold (pcold) ranks of DRAM, the fraction of requests expected

to be serviced by the hot tier (�), the amount of time we are

willing to queue requests destined for the cold tier (tq), and

the latency of entering and exiting the low-power mode

cold tier (te).Using psvc for average power to service a request, we can

represent the power equivalence as follows:

Tiered memory power

¼ Conventional memory powerphotNhot þ ðpsvc � photÞ�rpeaktsvcþ pcoldNcold þ ðpsvc � pcoldÞð1� �Þrpeaktsvc

þ ðphot � pcoldÞNcoldtetqð1� ð1� �Þrpeaktsvc=NcoldÞ

¼ photN þ psvcrpeaktsvc:

ð1Þ

After canceling out of the active power components

(terms with psvc), the baseline idle power can be redis-

tributed among a greater number of ranks by placing idle

cold ranks in low-power mode:

photNhot � phot�rpeaktsvc þ pcoldNcold � pcoldð1� �Þrpeaktsvc

þ ðphot � pcoldÞNcoldtetqð1� ð1� �Þrpeaktsvc=NcoldÞ

¼ photN:

ð2Þ

4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012

Fig. 2. Page access distribution at 4 KB granularity measured on our simulator.

If we define Pratio to be the ratio of the idle power of hotand cold ranks (phot=pcold) and Tratio to be the ratio of thetransition latency to the maximum queuing delay (te=tq), (2)can be reduced to an expression of the number of cold ranks(Ncold) that can be supported for a given number of hotranks (Nhot):

Ncold

¼ PratioðN �NhotÞ � ðPratio � 1Þð1� �Þrpeaktsvcð1� TratioÞð1þ ðPratio � 1ÞTratioÞ

:

ð3Þ

Pratio and rpeak are constants and tsvc values are in a fairlynarrow range for any given system design. � is applicationdependent. Given their values, we can solve for Ncold as afunction of Nhot for a particular point in the design space.Fig. 3 depicts the resulting set of possible memoryconfigurations that can be supported for a particular accesspattern (�) for various DRAM technologies (Pratio). For thisanalysis, we assume a baseline design with eight ranks(Nhot ¼ 8; Ncold ¼ 0) and a DRAM service time of one quarterthe DRAM activate-precharge latency. The range of hot-to-cold idle power ratios shown along the x-axis (6-10) istypical for modern DRAM devices [1]. As can be seen inFig. 3, we can support larger tiered memory sizes by usingdeeper low power modes (increasing Pratio), and when moreaccesses hit the hot tier (increasing �).

We use this model to identify iso-power tiered memoryconfigurations of interest. For our analysis, we considertwo configurations: 1) one with four hot ranks and 12 coldranks (Nhot;Ncold ¼ ð4; 12Þ), and 2) one with two hot ranksand 22 cold ranks (Nhot;Ncold ¼ ð2; 22Þ). Both of theseconfigurations fit in the same power budget even whenonly 70 percent of memory accesses hit the hot rank(� ¼ 0:7) and when the ratio of hot-to-cold idle power isonly six. Note that throttling ensures that even applicationswith particularly poor locality (� under 0.7) operate underthe fixed power budget, although they will suffer greaterperformance degradation.

3.1.2 Servicing Requests from Tiered Memory

Memory requests to hot tier ranks are serviced just as theywould be in a conventional system. For ranks in the coldtier, requests may be queued briefly to maximize the timespent in DRAM low-power mode and amortize the cost ofchanging power modes. Fig. 4 shows how requests to the

cold tier are handled. When there are no pending requestsdestined for a particular cold rank, it is immediately put inself-refresh mode. If a request arrives at a rank in self-refresh mode, e.g., at time t0 in Fig. 4, it is queued for up totq seconds. Any other requests that arrive during thisinterval are also queued. After experimenting with differentvalues of tq, we use a fixed value of tq ¼ 10 � texit þ tentry forour experiments. We found this value to be long enough toamortize entry/exit overhead, but also small enough to nottrigger software error recovery mechanisms.

To support this mechanism, we augment the memorycontroller DRAM scheduler with per-rank timers. When arequest arrives at a rank in self-refresh mode, the timer isstarted. When it reaches tq � texit, the controller issues theDRAM required to begin transition of the rank from self-refresh to active mode. The rank remains in the active modeas long as there are requests pending for it. Once all thepending requests are serviced, the rank is transitioned backinto low-power mode.

3.1.3 Identifying Hot and Cold Pages

Our tiered memory mechanism is most effective whenmost accesses are made to the hot tier, so we track pageaccess frequency and periodically migrate pages betweenthe hot and cold tiers based on their access frequency. Toidentify hot/cold pages, we assume that the memorycontroller is augmented to support a version of the accessfrequency tracking mechanism proposed by Ramos et al.[26], which builds on the Multiqueue (MQ) algorithm [30]first proposed for ranking disk blocks.

The algorithm employs M LRU queues of page descrip-tors (q0 � qM�1), each of which includes a page number,reference count, and an expiration value. The descriptors inqM�1 represent the pages that are most frequently accessed.The algorithm also maintains a running count of total

SUDAN ET AL.: TIERED MEMORY: AN ISO-POWER MEMORY ARCHITECTURE TO ADDRESS THE MEMORY POWER WALL 5

Fig. 3. Iso-Power tiered memory capacity relative to an 8-rank conventional system.

Fig. 4. Cold tier rank mode changes with request arrival.

DRAM accesses (Current). When a page is first accessed,the corresponding descriptor is placed at the tail of q0 andits expiration value initialized to Currentþ Timeout, whereTimeout is the number of consecutive accesses to otherpages before we expire a page’s descriptor. Every time thepage is accessed, its reference count is incremented, itsexpiration time is reset to Currentþ Timeout, and itsdescriptor is moved to the tail of its current queue. Thedescriptor of a frequently used page is promoted from qi toqiþ1 (saturating at qM�1) if its reference count reaches 2iþ1.Conversely, the descriptor of infrequently accessed pagesare demoted if their expiration value is reached, i.e., theyare not accessed again before Timeout requests to otherpages. Whenever a descriptor is promoted or demoted, itsreference count and expiration value are reset. To avoidperformance degradation, updates to these queues areperformed by the memory controller off the critical pathof memory accesses, using a separate queue for updates anda small on-chip SRAM cache.

3.1.4 Migrating Pages between the Hot and Cold Tiers

To ensure that frequently accessed pages are kept in the hottier, we periodically migrate pages between tiers based ontheir current access frequency. We employ an epoch-basedscheme wherein the hypervisor uses page access informa-tion to identify pages that should be migrated between thetiers. Using this access information at the end of everyepoch, a migration daemon moves pages between the tierssuch that the most frequently accessed pages among boththe tiers are placed in the hot tier for the subsequent epoch,pushing out less frequently accessed pages if needed to thecold tier.

To preserve data coherence, the hypervisor first protectsany pages being migrated by restricting accesses to them forthe duration of the migration and updates the correspond-ing page table entries. Once the migration is complete, thehypervisor shoots down the stale TLB entries for themigrated pages and flushes any dirty cache lines belongingto that page. The first subsequent access to a migrated pagewill trigger a TLB fault and the first subsequent access toeach cache line in a migrated page will suffer a cache miss,both of which we faithfully model in our simulations. Ahardware block copy mechanism could reduce much of theoverhead of migration, but we do not assume such amechanism is available.

An important consideration is how long of an epoch toemploy. Long epochs better amortize migration overhead,but may miss program phase changes and suffer whensuddenly “hot” data are kept in the cold tier too long. Anotherconsideration is whether to limit the amount of migration perepoch. We found that a 40M-cycle epoch length and a limit of500 (4 KB) page migrations per epoch achieved a goodbalance. We discuss the impact of these choices (epoch lengthand per-epoch migration cap) in Section 4.

4 RESULTS

4.1 Evaluation Metric

Our focus is on power- and memory-constrained VMconsolidation scenarios, which are increasingly common indata centers. We assume that all other resources are plentifuland do not impact VM performance. Memory bandwidth,

CPU resources and other shared resources are underutilizedin typical consolidation scenarios, e.g., Kozyrakis et al. foundthat large online services consume less than 10 percent ofprovisioned bandwidth, but close to 90 percent of memorycapacity [22]. Thus, in our analysis, we assume that powerlimits how much memory a server can support and theamount of memory available is the main constraint on thenumber of VMs that can be run effectively on a server.

The goal of our work is to maximize aggregate serverperformance for a given power budget, which involvesbalancing the total amount of memory present in a server,which determines how many VMs can be loaded, againstthe amount of “hot” memory present, which affects per-VMperformance. We evaluate iso-power tiered memory on itsability to boost aggregate server performance in memory-constrained virtualized environments by increasing theavailable memory within a given memory system powerbudget. We evaluate the effectiveness of iso-powered tieredmemory by simulating the performance of individual VMswhen they are allocated different amounts of memory,which corresponds to allocating different numbers of VMson a server with a fixed memory size. We also simulate theimpact of memory being divided into hot and cold tiers andthe overhead of managing the tiers, both of which impactindividual VM performance.

As a baseline, we first measure the memory footprintand performance of a single VM in a virtualized environ-ment. In the following discussion, let X be the memoryfootprint of a particular VM instance, measured in 4 KBsized pages. If the VM instance is allocated at least X pages,it fits entirely in main memory and never suffers from apage fault. When we run more VMs than can fit entirely in aserver’s main memory, we allocate each VM a proportionalfraction of memory. For example, if we run two VMs, onewith an X-page memory footprint and one with a 2X-pagememory footprint, on a server with X pages of memory, weallocate 1

3X to the first VM and 23X to the second.

We report aggregate system performance when runningN VMs as:

Aggregate System Performance

¼ Individual VM Performance � Number of VMs:

Thus, to determine the aggregate system performance fora given tiered memory configuration, we first determine thecorresponding per-VM memory capacity and then employ acycle-accurate simulator to determine the performance of asingle VM with this allocation. Note again that we assumethat memory capacity is the primary limit on the number ofVMs that a server can support. Thus, we simulate VMsrunning in isolation and do not model contention for cores,memory controllers, or memory channels. This model isrepresentative of typical consolidation scenarios, where thenumber of VMs ready to execute at any given moment isless than the number of cores and contention for memorycontroller and memory channel resources does not sub-stantively impact performance. This approach allows us toevaluate points in the design space that are infeasible tosimulate on a high fidelity full system simulator, e.g., verylarge numbers of VMs running on large tiered memoryconfigurations.

To simplify the discussion and enable easier cross-experiment comparison, our baseline configuration entails

6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012

100 VMs running on server with a nontiered memory of sizeof 100X. The specific value of X is workload dependent butunimportant—it simply denotes a baseline scenario wherewe allocate as many VMs as fit in the server’s physicalmemory without paging. All other results are normalized tothis baseline configuration.

4.2 Experimental Infrastructure and Methodology

We evaluate our proposal using the Mambo full systemsimulator [6]. Each VM is assigned a single out-of-ordercore. We employ a cycle-accurate power-performancemodel of a high-performance single-channel memorysystem with fully buffered, DDR3 memory. Details of thecore and memory system modeled are listed in Table 1.

We evaluate applications from the PARSEC [5] andSPEC-CPU 2006 benchmark suites [14]. We use the simlargeinputs for PARSEC and the ref inputs for SPEC. For theSPEC benchmarks, we simulate 5 billion instructions afterappropriate warmup. For each PARSEC benchmark, we usethe specified Region-of-Interest (ROI) [5].

To determine the baseline performance of each workload,we model just enough memory to hold the entire workloadimage without paging. Each workload’s pages are distrib-uted uniformly over the available ranks. We model a 4 KBinterleave granularity so each page fits within a single rank.To model scenarios with less memory, we simply reduce thesimulated number of rows in each DRAM device withoutaltering any other memory system parameters, as explainedbelow. As an example, consider an application with an 80 MB(20,000 4 KB pages) memory footprint. In the baseline

configuration (100 VMs running on a server with 100XDRAM capacity), the VM’s 80 MB footprint would beuniformly distributed across eight ranks of active DRAM.To model 125 VMs executing on the same memory config-uration, we model the VM running on a system with 64 MB ofDRAM (80 MB divided by 1.25) equally distributed across theeight ranks of active DRAM. The VM would suffer pagefaults whenever it touched a page of memory not currently inits 64 MB DRAM allocation, in which case we model LRUpage replacement.

The simulator models every DRAM access to determine ifit is to a hot rank, a cold rank, or a page not currently presentin memory (thereby causing a page fault). The simulatoremulates the tiered memory architecture described inSection 3, including the migration of data between tiersbased on usage. Every DRAM rank is associated with aqueue of pending requests. Requests to a hot rank areserviced as soon as the channel and rank are free. Requests toa cold rank that is currently in the low-power state arequeued, but as described in Section 3, no request is delayedmore than tq DRAM cycles. The simulator also models cross-tier data migration. The migration epoch length is configur-able, but unless otherwise specified all experiments used a40M CPU cycle (10 ms) epoch. Sensitivity to epoch length ispresented below.

To track which pages are resident in memory, thesimulator emulates OS page allocation and replacement.We model random page allocation, which corresponds tosituations where the OS free page list is fragmented, as iscommon on servers that have been executing for more than atrivial amount of time. We conservatively model a low

SUDAN ET AL.: TIERED MEMORY: AN ISO-POWER MEMORY ARCHITECTURE TO ADDRESS THE MEMORY POWER WALL 7

TABLE 1Simulator Parameters

(1 msec) page fault service time, e.g., by using an SSD, but ourresults are not sensitive to higher service times, e.g., due tousing a disk paging device, since performance degradation issevere even with a 1 msec service time.

4.3 Evaluation

Using the analytic model described in Section 3.1.1, weidentified three candidate iso-power memory configurations:

. Baseline. Eight hot ranks and 0 cold ranks.

. 2X-capacity. Four hot ranks and 12 cold ranks.

. 3X-capacity. Two hot ranks and 22 cold ranks.

As seen in Fig. 3, even with only a 6:1 hot:cold idle powerratio, all three of these configurations are iso-power when atleast 70 percent of requests are serviced by the hot tier(� � 0:7). The 2X-capacity configuration is iso-power evendown to � ¼ 0:5, while the 3X-capacity configuration needs� > 0:5. For the 2X- and 3X-capacity configurations, theadditional ranks on the same channels as the baseline ones,will see higher access latency [18], an effect that oursimulator models. Recall that the total memory capacity ofthe baseline model is enough to hold 100 VMs in theirentirety; the 2X- and 3X-capacity configurations haveenough memory to hold 200 and 300 VMs, respectively,but a large portion of this memory is in self-refresh mode.

Fig. 5 presents the results for one representative work-load, libquantum, for each modeled memory configuration.Fig. 5a plots normalized aggregate performance. Asexpected, baseline performance scales linearly up to 100VMs, after which performance drops off dramatically due tothe high overhead of page faults. The performance of the2X-capacity configuration scales almost perfectly up to200 VMs, where it achieves 176 percent of the aggregateperformance of the baseline organization, after which it alsosuffers from high page fault overhead. The 3X-capacityconfiguration is able to squeeze out a bit more aggregateperformance, 188 percent when 300 VMs are loaded.

There are several important trends to note in these results.First, there is a clear drop in the relative benefit of tiering

going from the 2X- to 3X-capacity organizations. Fig. 5billustrates this effect—it plots per-VM performance for the2X- and 3X-capacity configurations. The dropoff in per-VMperformance in the 2X-capacity configuration is modest,never more than 15 percent, because four hot ranks areenough to capture most accesses. In contrast, with only twohot ranks, per-VM performance for the 3X-capacity config-uration drops as much as 37 percent at 300 VMs. The impactof slower accesses to cold tiers causes a clear drop in benefitas we move to more aggressive tiering organizations.

Another trend to note in Fig. 5 is the gap between thebaseline and 3X-capacity results at 100 VMs and the gapbetween the 2X- and 3X-capacity results at 200 VMs. Thesegaps motivate a tiered memory architecture that adjusts thesize of the hot and cold tiers as demand for memorycapacity grows and shrinks. Under light load, baselineperformance is best and eight ranks should be kept hot,with the remainder turned off. As load increases, the systemshould shift to the 2X-capacity configuration, whichcontinues to scale well up until 200 VMs are present.Finally, the system should shift to the 3X-capacity config-uration under the heaviest loads. Overall system perfor-mance with such a design should track the convex hull ofthe three curves. While we do not simulate the ability toshift aggregate capacity based on demand, it can be realizedby using (capacity) page fault misses to trigger increasingmemory capacity by adding more cold ranks and suitablyreducing the number of hot ranks.

Fig. 6 presents the results for the remaining applicationsthat we evaluated. The relatively smaller gaps between thebaseline and 3X-capacity results at 100 VMs indicates thatthese applications are not as sensitive as libquantum toslower cold tier accesses. Fig. 7 presents the peak aggregateperformance for all seven benchmarks for the 2X and 3X-capacity configurations relative to the baseline design. Theaggregate performance of x264, facesim, and fluidani-

mate scale almost linearly with memory capacity, achievingover 186 percent speedup on the 2X-capacity configurationover 275 percent on the 3X-capacity configuration. soplex

8 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012

Fig. 5. Behavior of libquantum on a tiered memory system.

and ferret scale fairly well, achieving over 180 and250 percent on the 2X- and 3X-capacity configurations,

respectively. Finally, canneal and libquantum both scalewell on the 2X-capacity configuration, but experience fairly

small benefits when going to the 3X-configuration.To understand why individual applications benefit

differently from the tiered organizations, we investigatedhow they access memory. Fig. 8 presents the memory access

characteristics of all seven benchmark programs. Fig. 8apresents the total number of memory accesses per 1,000instructions, which indicates how memory-bound eachapplication is. Fig. 8b plots the number of cold tier accessesper 1,000 instructions for the 3X-capacity configuration aswe vary the number of VMs. These results confirm ourtheory of why the various benchmarks perform as they doon the various iso-powered tiered memory configurations.

SUDAN ET AL.: TIERED MEMORY: AN ISO-POWER MEMORY ARCHITECTURE TO ADDRESS THE MEMORY POWER WALL 9

Fig. 6. Aggregate performance for baseline and 3X-capacity configurations.

The two applications whose performance scales the least,canneal and libquantum, are memory intensive and accessthe cold tier far more frequently compared to the otherapplications. In contrast, fluidanimate, facesim, andx264 access the cold tier very rarely even in the 3X-capacityconfiguration, and thus experience near linear performanceincrease. In general, the benefit of using very large iso-power memory configurations is dependent on the spatiallocality of a given application.

In addition to slower accesses to the cold tier, anotherpotential overhead in our tiered memory architecture is thecost of migrating of pages between tiers after each epoch.Short epochs can respond to application phase changesmore rapidly, and thus reduce the rate of cold tier accesses,but the overhead of frequent migrations may overwhelmthe benefit. To better understand the tradeoffs, we re-ranour experiments with four different epoch lengths:

1. 20M CPU cycles,2. 40M CPU cycles,3. 50M CPU cycles, and4. 100M CPU cycles—recall that our baseline evalua-

tion described earlier used a 40M-cycle epoch.

We found that while there was minor variation in perfor-

mance for individual applications (e.g., soplex, canneal,

and facesim performed best with 20M-cycle epochs, while

ferret, fluidanimate, and x264 performed best with

100M-cycle epochs), the performance variance for all

applications across all epoch lengths was less than 5 percent.

Thus, we can conclude that for reasonable choices of epoch

length, performance is insensitive to epoch length and there

is no need to adjust it dynamically.For our final sensitivity study, we evaluated the impact

of limiting the maximum number of migrations per epoch.

10 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012

Fig. 7. Peak aggregate performance for 2X- and 3X-capacity memory configurations.

Fig. 8. Memory behavior of benchmark applications.

Our baseline results, reported earlier, limited the number ofmigrations per VM to 500 per epoch, which corresponds toa maximum of 1 percent runtime overhead for migrationgiven the available memory bandwidth. Fig. 9 presentsnormalized performance for four applications on the 3X-capacity configuration for two other migration limits: 1) 100migrations per VM per epoch (20 percent of the originallimit) and 2) UNLIMITED migrations. In general, perfor-mance is insensitive to any limits on migrations, with theexception of libquantum when migrations were con-strained to only 100 per VM per epoch. Consequently, wecan conclude that limiting migrations per epoch to limitoverhead has little impact on performance, and it is better toallow many migrations than to try to artificially limit them.

5 DISCUSSION

This section elaborates on some of the important experi-mental and implementation aspects of our proposedsolution.

5.1 Contention for Shared Resources among VMs

Our proposed iso-power tiered memory architecture doesnot consider or address contention among VMs for sharedresources other than main memory capacity. In most serverconsolidation scenarios, memory capacity is the primaryfactor limiting the number of VMs that a particular servercan host, not memory bandwidth or compute capability,and memory capacity is increasingly limited by power. Weevaluated the memory bandwidth needs of our workloads,indirectly illustrated in Fig. 8a. In all cases, the memorybandwidth needs of these applications is significantlybelow the per-core bandwidth available on mid to high-end servers, which are designed to run very memory-intensive workloads. When we modeled multiple VMsrunning in parallel sharing a single memory channel, theperformance impact of contention for the channel (andassociated memory controller) was negligible.

When memory bandwidth is a bottleneck, e.g., when theVMs being consolidated are running memory-intensiveapplications, the performance impact of increased memorybandwidth contention and iso-powered tiered memorylatency overheads will tend to occur at the same time. As

can be seen in Figs. 7 and 8a, the workloads with the highestmemory bandwidth requirements, libquantum and can-

neal, experience the least benefit from tiering, at least forthe 3X-capacity configuration. These kinds of memory-intensive applications are the same ones that will suffer (orcause) the most memory bandwidth contention problems incurrent consolidation scenarios, so a smart consolidationplatform will less aggressively consolidate these workloadsalready. In other words, a memory bandwidth-aware VMplacement mechanism will naturally do a good job of iso-powered tiered memory VM placement.

Another factor that we do not address is limitedcompute capacity. Our models assume that if there isenough memory to hold N VMs, then there is sufficientcompute capacity to execute them. In our experience, this isalmost always the case for modern servers, at least forcommercial applications that tend to have large memoryfootprints. As discussed earlier, per-processor core countsare growing much faster than memory capacity is, somemory capacity (power) is going to become an even moreimportant factor limiting performance in the future. Never-theless, for situations where single-threaded performance ismore important than aggregate performance, an iso-powertiered memory system may want to be less aggressive aboutusing the (slow) cold ranks.

5.2 Iso-Power Validation

Iso-power tiered memory organizations are guaranteed tofit under the same maximum power cap as the baselinedesign. This power cap is determined by the physical powerdelivery constraint to the system. The specific powerconsumed by each application on each configuration willvary, but will never exceed the baseline memory powerbudget, regardless of memory access pattern and migrationfrequency. The baseline power cap is a server-specific valuethat depends on the total DRAM capacity, which deter-mines the idle-state power consumption of DRAM, and themaximum rate of memory requests, which determines theadditional active power component.

Our iso-powered tiered architecture does not changethe maximum bandwidth available, so the active powercomponent of a particular tiered organization cannot exceedthat of a worst case memory-intensive application runningon the baseline configuration, regardless of hot-cold accesspatterns or migration frequency. Migration bandwidthcompetes with demand request bandwidth (from differentcores), and thus cannot cause the total bandwidth con-sumed by a channel to exceed its limit. We should note thateven for the most migration-intensive combination ofapplication and tiered memory configuration, migrationbandwidth was only 200 MB/s, which is roughly 1 percentof available channel bandwidth (19.2 GB/s). Thus, both thepower and performance impact of migration is very small.

The static power component of tiered memory power isdetermined by the configuration. We choose iso-powerconfigurations using the analytical model derived inSection 3.1.1. The power equivalence of a particular tieredconfiguration and the baseline design derives from theratio of hot:cold idle power (Pratio) and the hit rate ofaccess to the hot tier (�). Pratio is a constant for a givensystem, while � can be measured and the impact of the

SUDAN ET AL.: TIERED MEMORY: AN ISO-POWER MEMORY ARCHITECTURE TO ADDRESS THE MEMORY POWER WALL 11

Fig. 9. Impact of limiting number of migrations per epoch.

rare situation of a poor � on power controlled via requestthrottling. The configurations identified by our analyticalmodel are conservative, but a real implementation of iso-power tiered memory would likely include hardware-firmware mechanisms to enforce the power cap despiteunexpected changes in DRAM device efficiency, memoryaccess behavior, and ambient temperature.

5.3 Supporting Additional Memory

The goal of our iso-powered tiered memory architecture isto increase memory capacity within a fixed powerconstraint. DRAM ranks can be added to existing on-boardDRAM channel with FB-DIMMs. Memory can also beadded in the form of memory extensions (e.g., IBM eX5) anddedicated memory extension enclosures (e.g., IBM MAX5).One could also consider enabling more channels, but thiswould require modifying the processor design to incorpo-rate more memory controllers and allocate additionalI/O pins. All such approaches could incorporate variantsof our proposed tiered memory design.

An alternative way to increase memory capacity at afixed power budget is to use load-reduced DRAM DIMMS(LR-DIMMS) in place of traditional DDR3 DIMMs [21]. Weconsider this to be an appealing approach, but it iscomplimentary to our proposal since one could build iso-powered tiered memory from LR-DIMMs to furtherincrease the memory capacity for a given power budget.

6 RELATED WORK

Although there were some notions of memory tiering invery early computing systems, the modern development ofmultilevel main memory appears to start with the work ofEkman and Stenstrom [10], [11]. They propose a two-tierscheme similar to ours and make the same observation thatonly a small fraction of data is frequently accessed. Theymanage the hot and cold tiers by using the TLB to markcold pages as inaccessible—when a cold tier is accessed, apage fault occurs which causes the OS to copy the accesseddata to the hot tier. In contrast, our scheme neverexperiences a page fault for data in the cold tier. Instead,we periodically reorganize data based on access frequencyto bring critical data into the hot tier. Their work focuses onreducing the data transfer overhead of moving data fromthe slow tier to the faster tier during a page fault. Ourapproach requires (modest) additional hardware support totrack page access frequency and manage the tiers, but inreturn is able to provide faster access to the cold tier and amore gradual drop in performance when applicationfootprints exceed the hot tier size. Also, Ekman andStenstrom do not attempt to maintain an iso-powerguarantee with their memory implementation.

Liu et al. [24] propose mechanisms to reduce idle DRAMenergy consumption. They employ refresh rates lower thanrecommended by the manufacturer, which can cause dataloss. They propose techniques to identify the critical data ofan application by introducing new programming languageconstructs along with OS and runtime support. Liu et al.claim that applications can cope with data loss fornoncritical data. This approach to reducing memory poweris orthogonal to ours, even though both entail introducing

heterogeneity into the DRAM system. We employ hetero-geneity to provide a power-performance tradeoff, whilethey tradeoff power and data integrity.

Jang et al. [19] propose methods to reduce main memoryenergy consumption using VM scheduling algorithms in aconsolidation environment. They propose to intelligentlyschedule VMs that access a given set of DRAM ranks tominimize the number of ranks in Active-Standby mode. Webelieve this approach is orthogonal to our proposal and canbe combined with our scheme to provide increasedaggregate memory power savings.

Lebeck et al. [23] propose dynamically powering downinfrequently used parts of main memory. They detect clustersof infrequently used pages and map them to powered downregions of DRAM. Our proposal provides a similar mechan-ism while guaranteeing we stay under a fixed memory powerbudget.

Ahn et al. [3] proposed an energy-efficient memorymodule, called a Multicore DIMM, built by groupingDRAM devices into multiple “virtual memory devices”with fine-grained control to reduce active DRAM power atthe cost of some serialization overhead. This approach isorthogonal to our proposal.

Huang et al. proposed Power-Aware Virtual Memory[15] to reduce memory power using purely softwaretechniques. They control the power-state of DRAM devicesdirectly from software. Their scheme reconfigures pageallocation dynamically to yield additional energy savings.They subsequently proposed extensions that incorporatehardware-software cooperation [16], [17] to further improveenergy efficiency. Their mechanisms reduce the powerconsumption of a given memory configuration, whereas weuse low-power DRAM modes to enable greater memorycapacity in a fixed power budget.

Waldspurger [27] introduced several techniques—pagesharing, reclaiming idle memory, and dynamic realloca-tion—to stretch the amount of memory available to enablemore virtual machines to be run on a given system. TheDifference Engine approach by Gupta et al. [12] and Satoriby Murray et al. [25] tackle the same problem of managingVM memory size. These proposals are complimentary toour approach and leveraging them would lead to evenhigher aggregate numbers of VMs per server.

Ye et al. [29] present a methodology for analyzing theperformance of a single virtual machine using hybridmemory. Using a custom virtualization platform, they studythe performance impact of slower, second-level memory onapplications. Their motivation is to leverage technologieslike flash storage to reduce disk accesses. In contrast, ourscheme aims to reduce the aggregate impact of tiering bymanaging data at a more fine-grained level while strictlyconforming to a fixed power budget.

Heterogenous DRAM-based memory designs were re-cently studied by Dong et al. [9] to improve total memorycapacity. They propose using system-in-package and 3Dstacking to build a heterogenous main memory. They alsomove data between memory tiers using hardware or OS-assisted migration. The heterogeneity in their design is alongthe dimensions of bandwidth and capacity. They focus onapplication performance and attempt to provide the lowest

12 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012

access latency and highest bandwidth to frequently accesseddata. We also strive to provide the lowest access latency to

frequency accessed data, but our work is different than theirapproach since we focus on power budget and leverageexisting low-power DRAM modes rather than emerging

packaging technologies.Flash-based hybrid memory schemes have also been

researched extensively lately. Badam and Pai [4] developed a

scheme to use flash memory in a flat address space alongwith DRAM while introducing minimal modification to userapplications. Their scheme allows building systems with

large memory capacity while being agnostic to where in thesystem architecture the flash capacity resides (PCIe bus,SATA interface, etc.). The performance characteristics of

flash are quite different than that of DRAM placed in a low-power mode, so the architecture and performance tradeoffs

of the two approaches are quite different. Our work could becombined with a flash-based hybrid memory scheme tocreate a memory architecture with an additional tier.

Finally, researchers recently proposed replacing tradi-

tional DRAM (DDR3) with low-power DRAM chips (LP-DDR2) designed for mobile and embedded environments

[21]. LP-DDR2 devices consume less power than serverDDR3 DRAM, but they have lower peak bandwidth andtheir design severely limits the number of devices that can

be supported on a single DRAM channel, both of which runcounter to the need to increase memory capacity andbandwidth. Iso-power tiered memory is nonetheless

equally applicable to LP-DDR2 main memories also andmay even be more effective with LP-DDR2 since LP-DIMMs

support even deeper low-power states than DDR3.

7 CONCLUSIONS AND FUTURE WORK

The goal of our work is to increase a server’s memorycapacity for a fixed memory power budget. We present a

two-tier iso-power memory architecture that exploits DRAMpower modes to support 2X or 3X as much main memory for

a constant power budget. In our design, we divide mainmemory into a relatively smaller hot tier, with DRAM placedin the standby/active state, and a larger cold tier, with

DRAM placed in the self-refresh state. Given the relative idlepower of DRAM in active and self-refresh modes, we are ableto provide a range of iso-power configurations that trade off

performance (in the form of “hot tier” size) for total capacity.In a server consolidation environment, supporting 3X as

much memory, even when much of it must be maintained ina slow low-power state, allows a single server to achieve2-2.5X the aggregate performance of a server with a

traditional memory architecture.There are many ways to extend our work. We are

investigating nontraditional DRAM organizations that allow

finer grained tradeoffs between power and performance.We are also studying tiered memory designs using differentmemory technologies for the tiers, including DRAM in

different low-power modes and emerging memory technol-ogies such as PCM and STTRAM. Finally, we are exploringways to integrate earlier ideas that leverage system software

to to guide data placement among tiers.

REFERENCES

[1] Micron DDR3 SDRAM Part MT41J512M4, http://download.micron.com/pdf/datasheets/dram/ddr3/2Gb_DDR3_SDRAM.pdf, 2006.

[2] EPA Report to Congress on Server and Data Center EnergyEfficiency, Aug. 2007.

[3] J. Ahn, J. Leverich, R.S. Schreiber, and N. Jouppi, “MulticoreDIMM: An Energy Efficient Memory Module with IndependentlyControlled DRAMs,” IEEE Computer Architecture Letters, vol. 8,no. 1, pp. 5-8, Jan. 2008.

[4] A. Badem and V. Pai, “SSDAlloc: Hybrid SSD/RAM MemoryManagement Made Easy,” Proc. Eighth USENIX Conf. NetworkedSystems Design and Implementation (NSDI), 2011.

[5] C. Benia et al., “The PARSEC Benchmark Suite: Characterizationand Architectural Implications,” technical report, Dept. ofComputer Science, Princeton Univ., 2008.

[6] P.J. Bohrer, J.L. Peterson, E.N. Elnozahy, R. Rajamony, A. Gheith,R.L. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R.O. Simpson, E.Speight, K. Sudeep, E.V. Hensbergen, and L. Zhang, “Mambo: AFull System Simulator for the PowerPC Architecture,” SIG-METRICS Performance Evaluation Rev., vol. 31, no. 4, pp. 8-12, 2004.

[7] J.B. Carter and K. Rajamani, “Designing Energy-Efficient Serversand Data Centers,” Computer, vol. 43, no. 7, pp. 76-78, July 2010.

[8] X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie,“Leveraging 3D PCRAM Technologies to Reduce CheckpointOverhead in Future Exascale Systems,” Proc. Conf. High Perfor-mance Computing Networking, Storage and Analysis (SC), 2009.

[9] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen, “Circuit andMicroarchitecture Evaluation of 3D Stacking Magnetic RAM(MRAM) as a Universal Memory Replacement,” Proc. 45th Ann.Design Automation Conf. (DAC), 2008.

[10] M. Ekman and P. Stenstrom, “A Case for Multi-Level MainMemory,” Proc. Workshop Memory Performance Issues: In Conjunc-tion with the 31st Int’l Symp. Computer Architecture (WMPI), 2004.

[11] M. Ekman and P. Stenstrom, “A Cost-Effective Main MemoryOrganization for Future Servers,” Proc. IEEE Int’l Parallel andDistributed Symp. (IPDPS), 2005.

[12] D. Gupta, S. Lee, M. Vrable, S. Savage, A.C. Snoeren, G. Varghese,G.M. Voelker, and A. Vahdat, “Difference Engine: HarnessingMemory Redundancy in Virtual Machines,” Proc. Eighth USENIXConf. Operating Systems Design and Implementation (OSDI), 2008.

[13] H. Hanson and K. Rajamani, “What Computer Architects Need toKnow about Memory Throttling,” Proc. WEED Conf., 2010.

[14] J.L. Henning, “SPEC CPU2006 Benchmark Descriptions,” ACMSIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1-17, 2006.

[15] H. Huang, P. Pillai, and K.G. Shin, “Design and Implementation ofPower-Aware Virtual Memory,” Proc. Ann. Conf. Usenix Ann.Technical Conf., 2003.

[16] H. Huang, K. Shin, C. Lefurgy, and T. Keller, “Improving EnergyEfficiency by Making DRAM Less Randomly Accessed,” Proc. Int’lSymp. Low Power Electronics and Design (ISLPED), 2005.

[17] H. Huang, K. Shin, C. Lefurgy, K. Rajamani, T. Keller, E.Hensbergen, and F. Rawson, “Software-Hardware CooperativePower Management for Main Memory,” Proc. Int’l Conf. Power-Aware Computer Systems, 2005.

[18] B. Jacob, S.W. Ng, and D.T. Wang, Memory Systems: Cache, DRAM,Disk. Elsevier, 2008.

[19] J.W. Jang, M. Jeon, H.S. Kim, H. Jo, J.S. Kim, and S. Maeng,“Energy Reduction in Consolidated Servers through Memory-Aware Virtual Machine Scheduling,” IEEE Trans. Computers,vol. 60, no. 4, pp. 552-564, Apr. 2011.

[20] JEDEC, JESD79: Double Data Rate (DDR) SDRAM Specification,JEDEC Solid State Technology Assoc., Virginia, USA, 2003.

[21] C. Kozyrakis, “Memory Management beyond Free(),” Proc. Int’lSymp. Memory Management (ISMM ’11), 2011.

[22] C. Kozyrakis, A. Kansal, S. Sankar, and K. Vaid, “ServerEngineering Insights for Large-Scale Online Services,” IEEE Micro,vol. 30, no. 4, pp. 8-19, July/Aug. 2010.

[23] A. Lebeck, X. Fan, H. Zeng, and C. Ellis, “Power Aware PageAllocation,” Proc. Ninth Int’l Conf. Architectural Support forProgramming Languages and Operating Systems (ASPLOS), 2000.

[24] S. Liu, K. Pattabiraman, T. Moscibroda, and B.G. Zorn, “Flicker:Saving Refresh-Power in Mobile Devices through Critical DataPartitioning,” Proc. Int’l Conf. Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), 2011.

SUDAN ET AL.: TIERED MEMORY: AN ISO-POWER MEMORY ARCHITECTURE TO ADDRESS THE MEMORY POWER WALL 13

[25] D.G. Murray, S. Hand, G. Milos, and M.A. Fetterman, “Satori:Enlightened Page Sharing,” Proc. USENIX Ann. Technical Conf.,2009.

[26] L. Ramos, E. Gorbatov, and R. Bianchini, “Page Placement inHybrid Memory Systems,” Proc. Int’l Conf. Supercomputing(ICS ’11), 2011.

[27] C.A. Waldspurger, “Memory Resource Management in vmwareesx Server,” Proc. Fifth Symp. Operating Systems Design andImplementation (OSDI), 2002.

[28] M. Ware, K. Rajamani, M. Floyd, B. Brock, J.C. Rubio, F. Rawson,and J.B. Carter, “Architecting for Power Management: The IBMPOWER7 Approach,” Proc. 16th IEEE Int’l Symp. High-PerformanceComputer Architecture (HPCA ’10), Jan. 2010.

[29] D. Ye, A. Pavuluri, C.A. Waldspurger, B. Tsang, B. Rychlik, and S.Woo, “Prototyping a Hybrid Main Memory Using a VirtualMachine Monitor,” Proc. Int’l Conf. Computer Design (ICCD), 2008.

[30] Y. Zhou, J. Philbin, and K. Li, “The Multi-Queue ReplacementAlgorithm for Second Level Buffer Caches,” Proc. General Track:USENIX Ann. Technical Conf., 2001.

Kshitij Sudan is currently working toward thePhD degree from the University of Utah. Hisresearch interests lie in the area of memorysubsystem design and data-center platforms. Inthe memory domain, he works on techniques toclose the widening gap between the memoryand compute capabilities of modern systems. Heprimarily focuses on DRAM memories but hasalso worked on Flash and Phase-Changememory technologies. In the data-center do-

main, he works on optimizations across the system stack to improve theperformance of big-data frameworks like Map-Reduce. He is a studentmember of the IEEE.

Karthick Rajamani received the BTech degreefrom the Indian Institute of Technology, Madras,and the PhD degree in electrical and computerengineering from Rice University. He is aresearch staff member and manager of thePower-Aware Systems Department at the IBMAustin Research Laboratory. He has worked ondynamic power management architectures fromembedded systems to clusters and high-endservers including designs of the IBM Power6

and Power7 processors and systems for IBM EnergyScale. His currentinterests includes energy-efficient memory subsystems, energy-awaresystems software and technologies for joint management of reliability-power-performance. He is a senior member of the IEEE and the IEEEComputer Society.

Wei Huang received the PhD degree in elec-trical engineering from the University of Virginia.He is a researcher at IBM Austin ResearchLaboratory. His research interests includepower-aware processors, systems, and datacenters. He has been working on power man-agement research for IBM POWER andz systems. He is a member of the IEEE.

John B. Carter leads the Data Center Network-ing research group at the IBM Research—Aus-tin. Prior to joining IBM Research, he was theassociate director of the University of Utah’sSchool of Computing, where he led researchprojects in the areas of multiprocessor computerarchitecture, distributed systems, and memorysystem design (e.g., Impulse and Khazana). Heis a senior member of the IEEE and the IEEEComputer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

14 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. X, XXXXXXX 2012