10
Scalable High Performance Main Memory System Using Phase-Change Memory Technology Moinuddin K. Qureshi Vijayalakshmi Srinivasan Jude A. Rivers IBM Research T. J. Watson ResearchCenter, Yorktown Heights NY 10598 {mkquresh, viji, jarivers}@us.ibm.com ABSTRACT The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-change materials is being actively investigated in the circuits community. Phase Change Memory (PCM) devices offer more density relative to DRAM, and can help increase main memory capacity of future systems while remaining within the cost and power constraints. In this paper, we analyze a PCM-based hybrid main memory system using an architecture level model of PCM. We explore the trade-offs for a main memory system consisting of PCM storage coupled with a small DRAM buffer. Such an architecture has the latency benefits of DRAM and the capacity benefits of PCM. Our evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and pro- vide a speedup of 3X. As PCM is projected to have limited write endurance, we also propose simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years. Categories and Subject Descriptors: B.3.1 [Semiconductor Memories]: Phase Change Memory General Terms: Design, Performance, Reliability. Keywords: Phase Change Memory, Wear Leveling, Endurance, DRAM Caching. 1. INTRODUCTION Current computer systems consist of several cores on a chip, and sometimes several chips in a system. As the number of cores in the system increases, the number of concurrently running applica- tions (or threads) increases, which in turn increases the combined working set of the system. The memory system must be capable of supporting this growth in the total working set. For several decades, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISCA’09, June 20–24, 2009, Austin, Texas, USA. Copyright 2009 ACM 978-1-60558-526-0/09/06 ...$10.00. DRAM has been the building block of the main memories of com- puter systems. However, with the increasing size of the memory system, a significant portion of the total system power and the total system cost is spent in the memory system. For example, Lefurgy et al. [15] report that as much as 40% of the total system energy is con- sumed by the main memory subsystem in a mid-range IBM eServer machine. Therefore, technology researchers have been studying new memory technologies that can provide more memory capacity than DRAM while still being competitive in terms of performance, cost, and power. Two promising technologies that fulfill these criteria are Flash and Phase Change Memory(PCM). Flash is a solid-state technol- ogy that stores data using memory cells made of floating-gate tran- sistors. PCM stores data using a phase-change material that can be in one of two physical states: crystalline or amorphous. While both Flash and PCM are much slower than DRAM, they provide supe- rior density relative to DRAM. Therefore, they can be used to pro- vide a much higher capacity for the memory system than DRAM can within the same budget. Figure 1 shows the typical access latency (in cycles, assuming a 4GHz machine) of different memory technologies, and their rela- tive place in the overall memory hierarchy. Hard disk drive (HDD) latency is typically about four to five orders of magnitude higher than DRAM [6]. A technology denser than DRAM and access la- tency between DRAM and hard disk can bridge this speed gap. Flash-based disk caches have already been proposed to bridge the gap between DRAM and hard disk, and to reduce the power con- sumed in HDD [13]. However, with Flash being 2 8 times slower than DRAM, it is still important to increase DRAM capacity to re- duce the accesses to the Flash-based disk cache. The access latency of PCM is much closer to DRAM, and coupled with its density ad- vantage, PCM is an attractive technology to increase memory ca- pacity while remaining within the system cost and power budget. Furthermore, PCM cells can sustain 1000x more writes than Flash cells, which makes the lifetime of PCM-based memory system in the range of years as opposed to days for a Flash-based main mem- ory system. There are several challenges to overcome before PCM can be- come a part of the main memory system. First, PCM being much slower than DRAM, makes a memory system comprising exclu- sively of PCM, to have much increased memory access latency; thereby, adversely impacting system performance. Second, PCM devices are likely to sustain significantly reduced number of writes compared to DRAM, therefore the write traffic to these devices must be reduced. Otherwise, the short lifetime may significantly limit the usefulness of PCM for commercial systems. 24

Scalable high performance main memory system using phase-change memory technology

  • Upload
    jude-a

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable high performance main memory system using phase-change memory technology

Scalable High Performance Main Memory System UsingPhase-Change Memory Technology

Moinuddin K. Qureshi Vijayalakshmi Srinivasan Jude A. Rivers

IBM ResearchT. J. Watson Research Center, Yorktown Heights NY 10598

{mkquresh, viji, jarivers}@us.ibm.com

ABSTRACT

The memory subsystem accounts for a significant cost and powerbudget of a computer system. Current DRAM-based main memorysystems are starting to hit the power and cost limit. An alternativememory technology that uses resistance contrast in phase-changematerials is being actively investigated in the circuits community.Phase Change Memory (PCM) devices offer more density relativeto DRAM, and can help increase main memory capacity of futuresystems while remaining within the cost and power constraints.

In this paper, we analyze a PCM-based hybrid main memorysystem using an architecture level model of PCM. We explore thetrade-offs for a main memory system consisting of PCM storagecoupled with a small DRAM buffer. Such an architecture has thelatency benefits of DRAM and the capacity benefits of PCM. Ourevaluations for a baseline system of 16-cores with 8GB DRAMshow that, on average, PCM can reduce page faults by 5X and pro-vide a speedup of 3X. As PCM is projected to have limited writeendurance, we also propose simple organizational and managementsolutions of the hybrid memory that reduces the write traffic toPCM, boosting its lifetime from 3 years to 9.7 years.

Categories and Subject Descriptors:

B.3.1 [Semiconductor Memories]: Phase Change Memory

General Terms: Design, Performance, Reliability.

Keywords: Phase Change Memory, Wear Leveling, Endurance,DRAM Caching.

1. INTRODUCTIONCurrent computer systems consist of several cores on a chip, and

sometimes several chips in a system. As the number of cores inthe system increases, the number of concurrently running applica-tions (or threads) increases, which in turn increases the combinedworking set of the system. The memory system must be capable ofsupporting this growth in the total working set. For several decades,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ISCA’09, June 20–24, 2009, Austin, Texas, USA.Copyright 2009 ACM 978-1-60558-526-0/09/06 ...$10.00.

DRAM has been the building block of the main memories of com-puter systems. However, with the increasing size of the memorysystem, a significant portion of the total system power and the totalsystem cost is spent in the memory system. For example, Lefurgy etal. [15] report that as much as 40% of the total system energy is con-sumed by the main memory subsystem in a mid-range IBM eServermachine. Therefore, technology researchers have been studyingnew memory technologies that can provide more memory capacitythan DRAM while still being competitive in terms of performance,cost, and power.

Two promising technologies that fulfill these criteria are Flashand Phase Change Memory(PCM). Flash is a solid-state technol-ogy that stores data using memory cells made of floating-gate tran-sistors. PCM stores data using a phase-change material that can bein one of two physical states: crystalline or amorphous. While bothFlash and PCM are much slower than DRAM, they provide supe-rior density relative to DRAM. Therefore, they can be used to pro-vide a much higher capacity for the memory system than DRAMcan within the same budget.

Figure 1 shows the typical access latency (in cycles, assuming a4GHz machine) of different memory technologies, and their rela-tive place in the overall memory hierarchy. Hard disk drive (HDD)latency is typically about four to five orders of magnitude higherthan DRAM [6]. A technology denser than DRAM and access la-tency between DRAM and hard disk can bridge this speed gap.Flash-based disk caches have already been proposed to bridge thegap between DRAM and hard disk, and to reduce the power con-sumed in HDD [13]. However, with Flash being 28 times slowerthan DRAM, it is still important to increase DRAM capacity to re-duce the accesses to the Flash-based disk cache. The access latencyof PCM is much closer to DRAM, and coupled with its density ad-vantage, PCM is an attractive technology to increase memory ca-pacity while remaining within the system cost and power budget.Furthermore, PCM cells can sustain 1000x more writes than Flashcells, which makes the lifetime of PCM-based memory system inthe range of years as opposed to days for a Flash-based main mem-ory system.

There are several challenges to overcome before PCM can be-come a part of the main memory system. First, PCM being muchslower than DRAM, makes a memory system comprising exclu-sively of PCM, to have much increased memory access latency;thereby, adversely impacting system performance. Second, PCMdevices are likely to sustain significantly reduced number of writescompared to DRAM, therefore the write traffic to these devicesmust be reduced. Otherwise, the short lifetime may significantlylimit the usefulness of PCM for commercial systems.

24

Page 2: Scalable high performance main memory system using phase-change memory technology

������������������

������������������

21

2 2 2 2 25 9 13 17 21

L1 CACHE

SRAM

LAST LEVEL CACHE

EDRAM DRAM FLASHPCM HARD DRIVE

MAIN MEMORY SYSTEM

Typical Access Latency (in terms of processor cycles for a 4 GHz processor)

HIGH PERFORMANCE DISK SYSTEM

2 2 2 23 7 11 15

219

223

Figure 1: Latency of different technologies in memory hierarchy. Numbers accurate within a factor of two.

There is active research on PCM, and several PCM prototypeshave been proposed, each optimizing for some important devicecharacteristics (such as density, latency, bandwidth, or lifetime).While the PCM technology matures, and becomes ready to be usedas a complement to DRAM, we believe that system architecturesolutions can be explored to make these memories part of the mainmemory to improve system performance. The objective of this pa-per is to study the design trade-offs in integrating the most promis-ing emerging memory technology, PCM, into the main memorysystem.

To be independent of the choice of a specific PCM prototype, weuse an abstract memory model that is D times denser than DRAMand S times slower than DRAM. We show that for currently pro-jected values of PCM (S ≈ 4, D ≈ 4), a main memory systemusing PCM can reduce page faults by 5X, and hence execute appli-cations with much larger working sets. However, because PCM isslower than DRAM, main memory access time is likely to increaselinearly with S, which increases the overall execution time. There-fore, we believe that PCM is unlikely to be a drop-in replacementfor DRAM. We show that by having a small DRAM buffer in frontof the PCM memory, we can make the effective access time andperformance closer to a DRAM memory.

We study the design issues in such a hybrid memory architectureand show how a two-level memory system can be managed. Ourevaluations for a baseline system of 16-cores with 8GB DRAMshow that PCM-based hybrid memory can provide a speedup of 3Xwhile incurring only 13% area overhead. The speedup is within10% of an expensive DRAM only system which would incur 4Xthe area. We use an aggressive baseline that already has a largeFlash-based disk cache. We show that PCM-based hybrid memoryprovides much higher performance benefits for a system withoutFlash or with limited Flash capacity.

As each cell in PCM can endure only a limited number of writes,we also discuss techniques to reduce write traffic to PCM. We de-velop an analytical model to study the impact of write traffic on thelifetime of PCM that shows how the “bytes per cycle” relates to av-erage lifetime of PCM for a given endurance (maximum number ofwrites per cell). We show that architectural choices and simple en-hancements can reduce the write traffic by 3X which can increasethe average lifetime from 3 years to 9.7 years.

To our knowledge, this is the first study on architectural analysisof PCM based main memory systems. We believe this will serveas a starting point for system architects to address the challengesposed by PCM, making PCM attractive to be integrated in the mainmemory of future systems.

2. BACKGROUND AND MOTIVATIONWith increasing number of processors in the computer system,

the pressure on the memory system to satisfy the demand of allconcurrently executing applications (threads) has increased as well.Furthermore, critical computing applications are becoming moredata-centric than compute-centric [9]. One of the major challengesin the design of large-scale, high-performance computer systemsis maintaining the performance growth rate of the system mem-ory. Typically, the disk is five orders of magnitude slower thanthe rest of the system [6] making frequent misses in system mainmemory a major bottleneck to system performance. Furthermore,main memory consisting entirely of DRAM is already hitting thepower and cost limits [15]. Exploiting emerging memory technolo-gies, such as Phase-Change Memory (PCM) and Flash, becomecrucial to be able to build larger capacity memory systems in thefuture while remaining within the overall system cost and powerbudgets. In this section, we first present a brief description of thePhase-Change Memory technology, and highlight the strengths ofPCM that makes it a promising candidate for main memory of high-performance servers. We present a simple model that is useful indescribing such emerging memory technologies for use in com-puter architecture studies.

2.1 What is Phase-Change Memory?PCM is a type of non-volatile memory that exploits the prop-

erty of chalcogenide glass to switch between two states, amorphousand crystalline, with the application of heat using electrical pulses.The phase change material can be switched from one phase to an-other reliably, quickly, and a large number of times. The amor-phous phase has low optical reflexivity and high electrical resistiv-ity. Whereas, the crystalline phase (or phases) has high reflexivityand low resistance. The difference in resistance between the twostates is typically about five orders of magnitude [24] and can beused to infer logical states of binary data.

While the principle of using phase change materials for memorycell was demonstrated in 1960s [23], the technology was too slowto be of practical use. However, the discovery of fast crystalliz-ing material such as Ge2Sb2Te5(GST) [27] and Ag- and In-dopedSb2Te(AIST) [25] has renewed industrial interest in PCM and thefirst commercial PCM products are about to enter the market. BothGST and AIST can crystallize in less than 100ns compared to 10µsor more for earlier materials [23]. PCM devices with extremelysmall dimensions as low as 3nm × 20nm have been fabricatedand tested. A good discussion on scaling characteristics of PCM isavailable in [24].

25

Page 3: Scalable high performance main memory system using phase-change memory technology

���������������

���������������

��������������������

��������������������

Contact

materialPhase−change

Access device

Insulator

Bitline

Wordline

Figure 2: Typical Phase-Change Memory Device

Figure 2 shows the basic structure of a PCM device [24]. ThePCM material is between a top and a bottom electrode with a heat-ing element that extends from the bottom electrode, and establishescontact with the PCM material. When current is injected into thejunction of the material and the heating element, it induces thephase change. Crystallizing the phase-change material by heat-ing it above the crystallization temperature (but below the melt-ing temperature) is called the SET operation. The SET operationis controlled by moderate power, and long duration of electricalpulses and this returns the cell to a low-resistance state, and logi-cally stores a 1. Melt-quenching the material is called the RESEToperation, and it makes the material amorphous. The RESET oper-ation is controlled by high-power pulses which places the memorycell in high-resistance state. The data stored in the cell is retrievedby sensing the device resistance by applying very low power. TheCurrent-Voltage curve for PCM is shown in Figure 3 [24].

One of the critical properties that enables PCM is threshold switch-ing [24]. To convert from crystalline to amorphous state, very highvoltage is required to deliver the high power. However, above aparticular threshold voltage, the conductivity of the material in theamorphous state increases rapidly thereby generating large currentflows, and consequently heats the material. If the current pulse isswitched off as soon as the threshold voltage is reached, the mate-rial returns to the high-resistance, amorphous state.

00 0.2 0.4 0.6 0.8 1.0 1.2 1.4

RESET region

Thresholdvoltage

Readvoltage

regionSET

1.00

0.75

0.50

0.25

Cu

rre

nt

(a.u

.)

Voltage (a.u.)

Figure 3: Typical I-V curve for PCM device, demonstrating

threshold switching (a.u.= arbitrary units)

2.2 Why PCM for Main Memory?PCM is a dense technology with feature size comparable to DRAM

cells. Furthermore, a PCM cell can be in different degrees of par-tial crystallization thereby enabling more than one bit to be storedin each cell, Recently, a prototype [4] with two logical bits in eachphysical cell has been demonstrated. This means four states withdifferent degrees of partial crystallization are possible, which al-lows twice as many bits to be stored in the same physical area.Research is under way to store more bits per PCM cell.

Table 1 summarizes the properties of different memory tech-nologies based on the data obtained from the literature [11][12][4][22][10][5][26][1]. Write endurance is the maximum numberof writes for each cell. Data retention is the duration for whichthe non-volatile technologies can retain data. As PCM is still be-ing researched, different sources quote different relative merits oftheir PCM prototypes. Here we use the values of density, speed,endurance for PCM as a summary from the different prototype pro-jections, and this will serve as a reference point for a PCM design.

Table 1: Comparison of Memory Technologies

Parameter DRAM NAND NOR PCMFlash Flash

Density 1X 4X 0.25X 2X-4X

Read Latency 60ns 25 us 300 ns 200-300 ns

Write Speed ≈1 Gbps 2.4 MB/s 0.5 MB/s ≈100 MB/s

Endurance N/A 104

104

106 to 10

8

Retention Refresh 10yrs 10yrs 10 yrs

From Table 1, it is obvious that the poor density of NOR Flashmakes it uncompetitive compared to DRAM for designing mainmemory. NAND Flash has much better density than DRAM buthas significantly slow random access time (≈ 25us). The 400xlatency difference between DRAM and Flash means that DRAMcapacity will have to continue increasing in the future main mem-ories. Furthermore, the poor write endurance of Flash would resultin unacceptably low lifetime (in days) for main memory system.However, Flash has much better latency and power than HDD, andis finding widespread use as a disk cache [13]. In our studies, weassume that the disk system already has a large Flash-based diskcache. However, we show that even with such an aggressive disksystem, main memory capacity is still a problem.

Among the emerging memory technologies, PCM has the mostpromising characteristics. PCM offers a density advantage simi-lar to NAND Flash, which means more main memory capacity forthe same chip area. The read latency of PCM is similar to NORFlash, which is only about 4X slower compared to DRAM. Thewrite latency of PCM is about an order of magnitude slower thanread latency. However, write latency is typically not in the criti-cal path and can be tolerated using buffers. Finally, PCM is alsoexpected to have higher write endurance (106 to 108 writes) rel-ative to Flash (104 writes). We believe that these capabilities ofPCM combined with lower memory cost makes PCM a promisingcandidate for main memory of high-performance servers.

2.3 Abstract Memory ModelThe focus of this work is to study the effect of overall system per-

formance by adding PCM as a complement to the DRAM memory.In order to not restrict our evaluation of PCM to currently availableprototypes, we have adopted the approach of a generic memorymodel that combines existing DRAM memory with a PCM mem-ory. Therefore, in order to study the impact of adding PCM to mainmemory we first develop a simple abstract model using DRAM as

26

Page 4: Scalable high performance main memory system using phase-change memory technology

(c) (d)(a)

DISK

PnP1

DRAMMEMORY

MAIN

(b)

DISK

PnP1

DRAMMEMORY

MAIN

FLASH

DISK

MEMORY

MAIN

PnP1

FLASH

PCM STORAGE

MEMORY

MAIN

DISK

DRAM

PnP1

FLASH

PCM STORAGE

Figure 4: Candidate main memory organizations (a) Traditional system (b) Aggressive system with Flash-based disk cache (c) System

with PCM (d) System with Hybrid memory system

a reference. The PCM technology is described as {D, S, Wmax}which means the PCM is D times denser than DRAM, S timesslower read latency than DRAM and can endure a maximum ofWmax writes per cell. For the purposes of our evaluation, we usecurrently projected values1 for PCM technology and describe it as{D = 4, S = 4, Wmax = 10Million}.

Using a PCM with S = 4 as a replacement of DRAM signif-icantly increases the memory access latency, which can have anadverse impact on system performance. Therefore, we explore thetrade-offs of using a combination of DRAM and PCM as part of themain memory. Furthermore, the write endurance of PCM, althoughorders of magnitude better than Flash, can still limit main memorylifetime. The next section describes the organization and systemperformance challenges in having such a hybrid main memory sys-tem comprising of different technologies.

3. HYBRID MAIN MEMORYFigure 4 (a) shows a traditional system in which DRAM main

memory is backed by a disk. Flash memory is finding widespreaduse to reduce the latency and power requirement of disks. In fact,some systems have only Flash-based storage without the hard disks;for example, the MacBook Air [3] laptop has DRAM backed by a64GB Flash drive. It is therefore reasonable to expect future high-performance systems to have Flash-based disk caches [13] such asshown in Figure 4(b). However, because there is still two orders ofmagnitude difference in the access latency of DRAM memories andthe next level of storage, a large amount of DRAM main memory isstill needed to avoid going to the disks. PCM can be used instead ofDRAM to increase main memory capacity as shown in Figure 4 (c).However, the relatively higher latency of PCM compared to DRAMwill significantly decrease the system performance. Therefore, toget the best capacity and latency, Figure 4(d) shows the hybrid sys-tem we foresee emerging for future high-performance systems. Thelarger PCM storage will have the capacity to hold most of the pagesneeded during program execution, thereby reducing disk accessesdue to paging. The fast DRAM memory will act as both a buffer formain memory, and as an interface between the PCM main memoryand the processor system. We show that a relatively small DRAMbuffer (3% size of the PCM storage) can bridge most of the latencygap between DRAM and PCM.

1PCM cells are smaller in size (4.8 F 2 per cell) compared toDRAM cells (6-8 F 2 per cell) [2]. Each PCM cell can store mul-tiple bits [4]. PCM is expected to scale better than DRAM [9].Therefore, we assume PCM has 4X density compared to DRAM.

3.1 Hybrid Main Memory OrganizationIn a hybrid main memory organization, the PCM storage is man-

aged by the Operating System (OS) using a Page Table, in a man-ner similar to current DRAM main memory systems. The DRAMbuffer is organized similar to a hardware cache that is not visibleto the OS, and is managed by the DRAM controller. Although, theDRAM buffer can be organized at any granularity, we assume thatboth the DRAM buffer and the PCM storage are organized at a pagegranularity.

W

DATATP=1

T =

DATA

FROM FLASH/DISK

DRAM BUFFER

TAG: PCM TAG

PCM WRITE QUEUE

V TAG DP R

D: DIRTY BIT(S)

R: LRU BITS

P: PRESENT in PCM

V: VALID BIT

P=0

TO MEMORY CONTROLLER

W = WearLevelShift

PCM STORAGE

If P=0 or D=1

Figure 5: Lazy Write Organization

Figure 5 shows our proposed hybrid main memory organization.In addition to covering the larger read latency of PCM using aDRAM buffer, this organization tolerates an order of magnitudeslower write latency of PCM using a write queue, and overcomesthe endurance limit of PCM using techniques to limit the numberof writes to the PCM memory. These mechanisms are discussedin detail in the next few sections, along with a description of therelevant portions of Figure 5.

3.2 Lazy-Write OrganizationWe propose the Lazy-Write organization which reduces the num-

ber of writes to the PCM and overcomes the slow write speed of thePCM, both without incurring any performance overhead. When apage fault is serviced, the page fetched from the hard disk (HDD)is written only to the DRAM cache. Although allocating a pagetable entry at the time of page fetch from HDD automatically allo-

27

Page 5: Scalable high performance main memory system using phase-change memory technology

cates the space for this page in the PCM, the allocated PCM pageis not written with the data brought from the HDD. This eliminatesthe overhead of writing the PCM. To track the pages present onlyin the DRAM, and not in the PCM, the DRAM tag directory isextended with a “presence” (P) bit. When the page from HDD isstored in the DRAM cache, the P bit in the DRAM tag directory isset to 0. In the “lazy write” organization, a page is written to thePCM only when it is evicted from the DRAM storage, and the P bitis 0, or the dirty bit is set. If on a DRAM miss, the page is fetchedfrom the PCM then the P bit in the DRAM tag directory entry ofthat page is set to 1. When a page with P bit set is evicted fromthe DRAM, it is not written back to the PCM unless it is dirty. Fur-thermore, to account for the larger write latency of the PCM a writequeue is associated with the PCM. We assume that tags of both thewrite queue and the DRAM buffer are made of SRAM in order tohelp in probing these structures while incurring low latency. Giventhe PCM write latency, a write queue of 100 pages is sufficient toavoid stalls due to write queue being full.

Compared to an architecture that installs in both storage (PCMand DRAM), the “lazy write” architecture avoids the first writein case of dirty pages. For example, consider the kernel of theDAXPY application Y [i] = a · X[i] + Y [i]. In this case, “lazywrite” policy fetches page(s) for array Y only to DRAM. The PCMgets a copy of the pages of Y only after it has been read, modi-fied, and evicted from DRAM. If on the other hand page(s) for Y

were installed in the PCM at fetch time, then they would have beenwritten twice, once on fetch and again on writeback at the time ofeviction from DRAM. The number of PCM writes for the read onlypage for X remains unchanged in both configurations.

3.3 Line-Level WritesTypically, the main memory is read and written in pages. How-

ever, “endurance” limits of the PCM require exploring mechanismsto reduce the number of writes to the PCM. We propose writing tothe PCM memory in smaller chunks instead of a whole page. Forexample, if writes to a page can be tracked at the granularity of aprocessor’s cache line, the number of writes to the PCM page canbe minimized by writing only “dirty” lines within a page. We pro-pose Line Level WriteBack (LLWB), that tracks the writes to pagesheld in the DRAM on the basis of processor’s cache lines. To doso, the DRAM tag directory shown in Figure 5 is extended to holda “dirty” bit for each cache line in the page. In this organization,when a dirty page is evicted from the DRAM, if the P bit is 1 (i.e.,the page is already present in the PCM), only the dirty lines of thepage are written to the PCM. When the P bit of a dirty page chosenfor eviction is 0, all the lines of the page will have to be written tothe PCM. LLWB significantly reduces wasteful writes from DRAMto PCM for workloads which write to very few lines in a dirty page.To support LLWB we need dirty bits per line of a page. For exam-ple, for the baseline system with 4096B page and 256B linesize, weneed 16 dirty bits per page in the tag store of DRAM buffer.

3.4 Fine-Grained Wear-Leveling for PCMMemories with limited endurance typically employ wear-leveling

algorithms to extend their life expectancy. For example, in Flashmemories, wear-leveling algorithms arrange data in a manner sothat sector erasures are distributed more evenly across the Flashcell array and single sector failures due to high concentration oferase cycles are minimized. Wear leveling enables the host systemto perform its reads and writes to logical sector addresses, while thewear leveling algorithm remaps logical sector addresses to differentphysical sector addresses in the Flash array depending on write traf-fic. For example, the wear-leveling algorithm of TrueFFS [16] file

system tracks the least used sectors in the Flash to identify whereto next write data.

We also assume that the baseline PCM system uses a wear-levelingmechanism to ensure uniform usage of different regions of the PCMmemory. However, wear-leveling is typically done at a granularitylarger than page-size (e.g. 128KB regions) in order to reduce wear-out tracking overhead [13].

LLWB reduces write traffic to PCM. However, if only somecache lines within a page are written to frequently, they will wearout sooner than the other lines in that page. We analyze the distri-bution of write traffic to each line in a PCM page. Figure 6 showsthe total writeback traffic per dirty page for the two database appli-cations, db1 and db2. The average number of writes per line is alsoshown. The page size is 4KB and line size is 256B, giving a totalof 16 lines per page, numbered from 0 to 15.

02468

101214161820

Nu

m.

Wri

tes

for

Ea

ch L

ine

(Mil

lio

n)

Average

db1 db2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 6: Total write traffic for each of the sixteen lines (0-15)

for dirty pages in baseline system

For both db1 and db2 there is significant non-uniformity in whichlines in the page are written back. For example, in db1, Line0is written twice as often as average. This means Line0 may getendurance related failure in half the time. The lifetime of PCMcan be increased if the writes can be made uniform across all linesin the page. This can be done by tracking number of writes on aper line basis, however, this would incur huge tracking overhead.We propose a technique, Fine Grained Wear-Leveling (FGWL), formaking the writes uniform (in the average case) while avoiding perline storage. In FGWL, the lines in each page are stored in the PCMin a rotated manner. For a system with 16 lines per page the rotateamount is between 0 and 15 lines. If the rotate value is 0, the pageis stored in a traditional manner. If it is 1, then the Line 0 of theaddress space is stored in Line 1 of the physical PCM page, eachline is stored shifted, and Line 15 of address space is stored in Line0. When a PCM page is read, it is realigned. The pages are writtenfrom the Write Queue to the PCM in a line-shifted format. On apage fault, when the page is fetched from the hard disk, a PseudoRandom Number Generator (PRNG) is consulted to get a random4-bit rotate value, and this value is stored in the WearLevelShift(W) field associated with the PCM page as shown in Figure 5. Thisvalue remains constant until the page is replaced, at which point thePRNG is consulted again for the new page allocated in the samephysical space of the PCM.

A PCM page is occupied by different virtual pages at differenttimes and is replaced often (several times an hour). Therefore, overthe lifetime of the PCM page (in years) the random rotate valueassociated with each page will have a uniform distribution for therotate value of 0 through 15. Therefore, the average stress on eachphysical page will be uniform. Figure 7 shows the write traffic perdirty DRAM page with FGWL, for db1 and db2. FGWL makes thewrite traffic uniform for all the lines in the PCM page, which meansthe lifetime of PCM page is determined by the average-case writetraffic and not the worst-case write traffic to a line. ImplementingFGWL requires 4-bit storage per page (4KB).

28

Page 6: Scalable high performance main memory system using phase-change memory technology

02468

101214161820

Nu

m. W

rite

s fo

r E

ach

Lin

e (M

illi

on

)

Average

db1 db2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 7: Total write traffic for the sixteen lines (0-15) for dirty

pages using FGWL

3.5 Page Level Bypass for Write FilteringNot all applications benefit from more memory capacity. For ex-

ample, streaming applications typically access a large amount ofdata but have poor reuse. Such applications do not benefit from thecapacity boost provided by PCM. In fact, storing pages of such ap-plications only accelerates the endurance related wear-out of PCM.As PCM serves as the main memory, it is necessary to allocatespace in PCM when a page table entry is allocated for a page. But,the actual writing of such pages in the PCM can be avoided byleveraging the lazy write architecture. We call this Page Level By-pass (PLB). When a page is evicted from DRAM, PLB invalidatesthe Page Table Entry associated with the page, and does not installthe page in PCM. We assume that the OS enables/disables PLB foreach application using a configuration bit. If the PLB bit is turnedon, all pages of that application bypass the PCM storage.

4. EXPERIMENTAL METHODOLOGY

4.1 WorkloadsWe use a diverse set of applications from various suites, includ-

ing data-mining, numerical computation, Unix utilities, streamingapplications, and database workloads. Six out of the eight bench-marks are data-parallel applications, each containing sixteen identi-cal threads. These benchmarks are simulated to completion. Bench-marks qsort and bsearch (binary search) are Unix utilities. Gaussis a numerical algorithm used to solve linear equations. Kmeans isa data-mining algorithm used primarily to classify data into multi-ple clusters. In our studies, kmeans simulates 8 iterations of clus-tering and re-centering between two cluster centers. Daxpy andvdotp are streaming benchmarks derived from the stream suite [19].These two benchmarks spend more time in array initialization thanin computation. So we do not collect statistics in the initializationphase for these two benchmarks. We also use scaled versions of twoindustry-standard database workloads, derived from a mainframe

sever. For both database workloads, db1 and db2, we simulate fourbillion memory references (corresponding to a total transfer of 1Tera Byte of data between memory and processors). Table 2 showsthe relevant characteristics of the benchmarks used in our studies,including data-set size, page faults per million instructions for thebaseline 8GB memory, instructions/thread, and memory accessesper thousand instructions (MAPKI).

4.2 ConfigurationWe use an in-house system simulator for our studies. The base-

line configuration is a sixteen-core system comprising of four 4-core CMPs with the parameters given in Table 3. We use a simplein-order processor model so that we can evaluate our proposal forseveral hundred billion instructions in order to stress the multi-GBmain memory. Each core consists of private L1 and L2 caches, eachwith a linesize of 256B. We keep the processor parameters constantin our study.

Table 3: Baseline Configuration

System Sixteen cores (Four quad-core chips)

Processor Core single issue in-order, five stage pipeline, 4GHz

L1 caches (Private) I-cache and D-cache : 64 KB, 4-way 256B line

L2 cache (Private) 2MB, 16-way, LRU, 256B linesize

DRAM Memory 8GB, 320 cycles access latency,8 ranks of 8 banks each, bank conflict modeled

Off-chip Bus 16B wide split-transaction bus, 2:1 speed ratioDisk Cache (Flash) 32µs (128K cycles) avg. latency, 99% hit rate

Hard disk drive 2ms (8M cycles) average access latency

The baseline consists of an 8GB DRAM main memory whichcan be accessed in 320 cycles if there are no bank conflicts. Theoff-chip bus can transfer a cache line in 32 cycles and bus queuingdelays are modeled. A page size of 4KB is assumed. Virtual tophysical address translation is performed using a page table builtin our simulator. A clock style algorithm is used to perform pagereplacements. If there is PCM storage, it is accessible at 4X the la-tency of DRAM. In the hybrid memory configuration, the DRAMbuffer is 16-way with 4KB lines (same as page size), and is man-aged using LRU replacement. The interconnect for PCM is as-sumed to be identical to that between DRAM and processor.

We also assume that our memory system has a very aggressiveFlash-based disk cache (99% hit rate) so that the latency of pagefaults is reduced considerably. Although such an aggressive con-figuration will diminish the benefits of larger capacity PCM mainmemory, we expect such efficient Flash caches to be used in futurehigh-performance systems. We discuss systems without a Flashcache in Section 5.4.

Table 2: Benchmark Characteristics (Bn=Billions).

Name Description Data set for each thread PageF aults

Million instInst/thread MAPKI

qsort Quick sort 14M entries, 128B each 21.5 114Bn 1.66bsearch Binary search 7M entries, 256B each, 14M queries 2507 12Bn 20.8kmeans Clustering, Data Mining 256K pts * 1K attrib of 8B each 48 175Bn 0.82gauss Gauss Siedal method matrix 64K x 4K, 8B entry 97 97Bn 1.82daxpy Do Yi = a · Xi + Yi X, Y are 100M entries x 8B each 156 2.6Bn 3.75vdotp Vector dot product Xi · Yi X, Y are 100M entries x 8B each 205 2Bn 3.29

db1 Database OLTP 39.5 23Bn 10.86db2 Database web-based database 119.5 22.9Bn 10.93

29

Page 7: Scalable high performance main memory system using phase-change memory technology

5. RESULTS AND ANALYSIS

5.1 Page Faults vs. Size of Main MemoryFigure 8 shows the number of page faults as the size of the main

memory is increased from 4GB to 32GB, normalized to the base-line with 8GB. The bar labeledGmean denotes the geometric mean.Both database benchmarks, db1 and db2, continue to benefit fromincreasing the size of main memory. For bsearch, there are about100X more page faults at 8GB than at 32GB. Gauss and Kmeansfrequently reuse a data-set of greater than 16GB. These bench-marks suffer a large number of page misses, unless the memorysize is 32GB. Daxpy and vdotp do not reuse the page after ini-tial accesses, therefore the number of page faults is independentof main memory size. Overall, a 4X increase (8GB to 32GB) in thememory capacity due to the density advantage of PCM reduces theaverage number of page faults by 5X.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Pa

ge

Fa

ult

s N

orm

ali

zed

to

8G

B

4GB 8GB 16GB 32GB

db1

db2

qsor

t

bsea

rch

kmea

ns

gaus

s

daxp

y

vdot

p

Gm

ean

Figure 8: Number of page faults (normalized to 8GB) as size of

main memory is increased.

5.2 Cycles Per Memory AccessPCM can provide 4X more memory capacity than DRAM but

at 4X more latency. If the data-set resident in main memory isfrequently reused then the page-fault savings of the PCM systemmay be offset by increased memory access penalty on each access.Figure 9 shows the average cycles per memory access for four sys-tems: baseline 8GB DRAM, 32GB PCM, 32GB DRAM (an ex-pensive alternative), and 32GB PCM with 1 GB DRAM. For db2,qsort, bsearch, kmeans, and gauss, the increased capacity of PCMreduces average memory access time because of fewer page faults.For db1, even though the increased capacity of PCM reduced pagefaults by 46%, the average memory access time increases by 59%because the working set is frequently reused. Both daxpy and vdoptdo not benefit from more capacity but suffer because of increasedmemory latency. Having a 1GB DRAM buffer along-with PCMmakes the average memory access time much closer to the expen-sive 32GB DRAM system.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Norm

. C

ycl

es P

er M

emory

Acc

ess

8GB DRAM 32GB PCM 32GB DRAM 1GB DRAM+32GB PCM

db1

db2

qsor

t

bsea

rch

kmea

ns

gaus

s

daxp

y

vdot

p

Gm

ean

Figure 9: Average cycles per memory access (normalized to the

baseline 8GB DRAM system).

5.3 Normalized Execution TimeFigure 10 shows the normalized execution time of the four sys-

tems discussed in the previous section. The reduction in averagememory access time correlates well with the reduction in execu-tion time. On average, relative to the baseline 8GB system, thePCM-only 32GB system reduces execution time by 53% (speedupof 2.12X) and the 32GB DRAM only system reduces it by 70%(speedup of 3.3X). Whereas, the hybrid configuration reduces it by66.6% (speedup of 3X). For five out of the eight benchmarks, exe-cution time reduces by more than 50% with the hybrid memory sys-tem. Thus, the hybrid configuration provides the performance ben-efit similar to increasing the memory capacity by 4X using DRAM,while incurring only about 13% area overhead while the DRAM-only system would require 4X the area.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Norm

ali

zed

Exec

uti

on

Tim

e

8GB DRAM 32GB PCM 32GB DRAM 1GB DRAM+32GB PCM

db1

db2

qsor

t

bsea

rch

kmea

ns

gaus

s

daxp

y

vdot

p

Gm

ean

Figure 10: Execution time (normalized to 8GB DRAM).

5.4 Impact of Hard-Disk OptimizationsIn our baseline we used an aggressive Flash-based disk cache

with 99% hit rate. Figure 11 shows the speedup obtained with thehybrid memory (1GB DRAM + 32GB PCM) over 8GB DRAM forfour systems: a Flash only disk system, a Flash-cache with 99%hit rate, 90% hit rate, and a traditional system with no Flash-baseddisk cache. As, the benchmark bsearch has very large speedupsthe geometric mean without bsearch, namely GmeanNoB, is alsoshown. For current systems with limited size (or without) Flash-based disk cache the speedups are much higher than reported withour aggressive baseline. For the Flash-only disk system, the hybridmemory system provides an average speedup of 2.5X (2X exclud-ing bsearch). For a limited sized disk cache with 90% hit-rate,the hybrid memory system provides an average speedup of 4.46X(3.27X excluding bsearch). Whereas, for a traditional disk-onlysystem, the hybrid memory provides an average speedup of 5.3X(3.7X excluding bsearch). Thus, regardless of the HDD optimiza-tions, hybrid main memory provides significant speedups.

OnlyFlashNoHDD

FlashHit-99%

FlashHit-90%

NoFlashOnlyHDD

Sp

eed

up

Usi

ng P

CM

db1

db2

qsor

t

bsea

rch

kmea

ns

gaus

s

daxp

y

vdot

p

Gm

ean

Gm

eanN

oB 1

2

4

8

16

32

64

3

6

12

24

48

Figure 11: Speedup of Hybrid Memory over baseline as Flash

cache is varied. Note: Y-axis is in log scale.

30

Page 8: Scalable high performance main memory system using phase-change memory technology

5.5 Impact of PCM LatencyWe have assumed that PCM has 4X higher read latency than

DRAM. In this section, we analyze the benefit of hybrid memorysystem as the read latency of PCM is varied. Figure 12 shows theaverage execution time of the eight workloads normalized to the8GB DRAM system for PCM-only system and the hybrid memorysystem as PCM read latency is increased from 2X to 16X. For refer-ence, the average execution time for the 32GB DRAM-only systemis also shown. As the read latency of PCM is increased from 2X to16X, the average execution time of the PCM-only system increasesfrom 0.35 to 1.09. Thus, the higher read latency of PCM hurts per-formance even with 4X boost in capacity. The PCM-based hybridsystem, however, is much more robust to PCM read latency. Asthe read latency of PCM is increased from 2X to 16X, the averageexecution time of the hybrid system increases from 0.33 to only0.38.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

No

rma

lize

d E

xec

. T

ime

(Aver

age)

1X 2X 4X 8X 16X 2X 4X 8X 16X 1X

8GB DRAM 32GB Only-PCM 1GB DRAM+32GB PCM 32GB DRAM

Figure 12: Normalized Execution Time of PCM-only and hy-

brid systems as PCM read latency is varied (Note: PCM latency

is defined relative to DRAM latency (1X) ).

5.6 Impact of DRAM-Buffer SizeThe DRAM buffer reduces the effective memory access time

of the PCM-based memory system relative to an equal capacityDRAM memory system. A large DRAM buffer reduces the latencydifference between the two memory systems, but incurs more areaoverhead. Figure 13 shows the execution time of the hybrid sys-tem with 32GB PCM and DRAM buffer size varying from 0.25GBto 1GB to 2 GB to 8GB. Note that the execution time is normal-ized to the 32GB DRAM-only system and not the baseline. Exceptbsearch, all benchmarks have execution time with 1GB DRAMbuffer very close to the 32GB DRAM system. Thus, a 1GB DRAMbuffer provides a good trade-off between performance benefit andarea overhead.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Norm

ali

zed

Exec

. T

ime

(32G

B D

RA

M)

QGB 1GB 2GB 8GB

2.479

db1

db2

qsor

t

bsea

rch

kmea

ns

gaus

s

daxp

y

vdot

p

Gm

ean

Figure 13: Execution time of hybrid memory (with 32GB PCM)

when the size of the DRAM buffer is varied. All values are

normalized to the 32GB DRAM system.

5.7 Storage Overhead of Hybrid MemoryTable 4 shows the storage overhead for the hybrid memory con-

sisting of 32GB PCM and a 1GB DRAM. The significant addi-tional area overhead is the DRAM buffer. In addition to data, eachentry in the DRAM buffer requires 31 bits of tag-store (1 valid bit+ 16 dirty bits + 9-bit tag + 4-bit LRU + P bit). We assume thatthe tags of the DRAM buffer are made of SRAM in order to havefast latency. Relative to the 8GB of DRAM main memory for thebaseline, all the overhead is still less that 13% (We pesimisticallyassume SRAM cells incur 20 times the area of DRAM cells [2]).

Table 4: Storage Overhead wrt 8GB DRAM Memory

Structure Storage overhead

1 GB DRAM Buffer Data 1GB DRAM1GB DRAM Buffer Tag 1MB SRAM

( 4byte/page * 256K pages)

100-entry PCM write queue 400KB SRAM

Wear-leveling counters 4MB DRAM(4bit/page * 8M pages)

Total storage overhead (%) 1.005 GB (12.9%)

5.8 Power and Energy ImplicationsFigure 14 shows the three key metrics: power, energy, and energy-

delay product for the memory system of the baseline (8GB DRAM),hybrid memory (1GB DRAM + 32GB DRAM) and 32GB DRAM-only memory. All values are reported normalized to the baselineand are computed as geometric mean over the 8 benchmarks. Forthe DRAM storage we assumed DDR3 type memories and usedthe Micron System Power Calculator [21]. For the PCM systemwe used values of idle-power, read-power and write-power of PCMrelative to DRAM, based on the PCM prototypes [14].

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

No

rma

lize

d V

alu

es

PCM

DRAM

Power Energy EnergyDelayProd

(a) Baseline 8GB DRAM

(b) 1GB DRAM + 32GB PCM

(c) 32GB DRAM

a b c a b c a b c

Figure 14: Power, Energy, and Energy Delay Product of mem-

ory systems normalized to the Baseline.

The hybrid memory system consumes 33% more power than thebaseline but most (80%) of the power is consumed by the DRAMbuffer. In comparison, the system with 32 GB DRAM, consumestwice the power. The power increase for both of these systemscome from the reduced overall execution time relative to the base-line. For the hybrid memory PCM consumes only 20% of thepower. This is because the DRAM buffer filters out most of thePCM accesses and our proposed mechanisms further reduce writeaccesses to PCM.

Both the hybrid memory and 32GB DRAM-only systems con-sume much less energy than the baseline. On average, the hybridsystem consumes 45% the energy, compared to 59% for the 32GBDRAM system. These results confirm that the PCM-based hybridmemory is a practical power-performance efficient architecture toincrease memory capacity.

31

Page 9: Scalable high performance main memory system using phase-change memory technology

6. IMPACT OF WRITE ENDURANCEPCM has significantly less write endurance compared to DRAM.

PCM storage in a hybrid memory system receives write traffic whena newly fetched page from disk is installed in PCM, or when a pagealready present in PCM gets writeback updates from the DRAMbuffer. To analyze the impact of write traffic on the lifetime ofPCM, we first develop an analytical model.

For a FHz processor that operates for Y years, the PCM mainmemory must last for Y · F · 225 processor cycles, given that thereare ≈ 225 seconds in a year. Let PCM be of size S bytes which iswritten at a rate of B bytes per cycle. Let Wmax be the maximumnumber of writes that can be done to any PCM cell. Assumingwrites can be made uniform for the entire PCM capacity, we have

S

B· Wmax = Y · F · 225

(1)

Wmax =Y · F · B

S· 225

(2)

Thus, a 4GHz system with S=32GB written at 1 byte/cycle, musthave Wmax ≥ 224 to last 4 years. Table 5 shows the average bytesper cycle (BPC) written to the 32 GB PCM storage. The PCM-onlysystem has BPC of 0.317 (average lifetime of 7.6 years), but that isprimarily because the system takes much more time than DRAM toexecute the same program. Therefore, the lifetime of the PCM-onlysystem is not useful, and in-fact misleading. The more practicalPCM memory system is the hybrid with 1GB DRAM which has anaverage BPC of 0.807 (lifetime of 3 years). To improve PCM life-time we proposed three2 techniques to reduce the write traffic fromDRAM to PCM. First, Lazy Write (Section 3.2) which avoids thefirst write to PCM for dirty pages. Second, Line Level WriteBack(LLWB) (Section 3.3), that tracks which lines in the page are dirtyand writes only those lines from DRAM to PCM when the DRAMpage is evicted. Third, Page Level Bypass (PLB) (Section 3.5) forapplications that have poor reuse in PCM. We enable PLB only fordaxpy and vdotp, for which PLB completely eliminates writes toPCM.

Table 5 also shows the write traffic and average lifetime wheneach of the three optimizations are added one after another to thehybrid memory configuration. Lazy Write reduces BPC from 0.807to 0.725 (lifetime of 3.4 years). LLWB reduces BPC to 0.316 (av-erage lifetime of 7.6 years) and enabling PLB on top of LLWB in-creases lifetime to 9.7 years. Thus, design choices and architectureof main memory significantly affects the lifetime of PCM.

2The Fine-Grained Wear Leveling (FGWL) technique described inSection 3.4 does not reduce write traffic. It tries to ensure uniformwear-out of lines within a PCM page so that we can use averagewrite traffic in calculating the lifetime (assuming writes across dif-ferent pages are made uniform using some page-level wear levelingalgorithm).

7. RELATED WORKA NAND Flash based file buffer cache is used in [13] to reduce

the main memory power. However, Flash is still 28 times slowerthan DRAM, implying that increasing DRAM capacity is still im-portant to reduce accesses to the Flash-based disk cache. Our workis complementary to [13], and in fact in our evaluations we do as-sume that the HDD is supported by an aggressive FLASH cache.Furthermore, [13] uses improved error correction for FLASH mem-ories, which does not increase the lifetime of each individual cell.Whereas, we focus on reducing the write traffic to PCM so thateach cell wears out after a longer time. Both approaches can beused together for greater benefit than either scheme standalone.

Challenges and issues of using SRAMs for main memory is dis-cussed in [17]. The first level of main memory is an SRAM, fol-lowed by the traditional DRAM memory. The goal of the studywas to improve the memory speed, and not address the cost, orpower limits of the DRAM memory. The SRAM memory is in factlest denser, and consumes more power than the DRAM memory.However, the challenges highlighted in their work, namely, effi-cient address translation for a 2-level memory organization, pagesize optimizations for the SRAM and DRAM, changes in the OSto accommodate two level memories are still valid for our hybridDRAM and PCM memory systems.

In a related work [7, 8], a multi-level main memory organizationis evaluated showing that allocating only 30% of the memory to beaccessed in DRAM speed, and the rest at a slower rate does notsignificantly impact performance, and allows the use of techniquessuch as memory compression, and dynamic power management ofdifferent regions of memory. Their work still uses DRAM as thememory for the multiple levels in the hierarchy, and does not usehybrid memory technologies. However, the key finding in [7] thatit is possible to tolerate a longer access latency to a major part ofthe main memory inspired our work on using hybrid memory tech-nologies to increase the overall main memory capacity.

A virtualization-based methodology to study hybrid memories ispresented in [28]. However, that work focuses only on methodol-ogy, whereas we focus on management and organization of hybridmemories to get larger capacity while improving the overall systemcost, power, and lifetime.

Mangalagiri et al. [18] propose to use PCM based on-chip cachesto reduce power. However, we believe that the slower latency andlimited endurance of PCM makes it useful only after DRAM andnot as on-chip caches.

For increasing DRAM capacity, MetaRAM [20] use the MetaS-DRAM chipset between the memory controller and the DRAMwhich allows up to 4X more mainstream DRAM to be integratedinto existing DIMMs without any changes to hardware. However,this still does not address the issue of growth in the cost or powerfor main memory. It only allows the users who are willing to paya higher cost for DRAM to have higher capacity within the samemotherboard space.

Table 5: Average number of bytes per cycle written to PCM (and PCM lifetime if Wmax = 107)

Configuration db1 db2 qsort bsearch kmeans gauss daxpy vdotp Average Average Lifetime

PCM 32GB 0.161 0.186 0.947 0.127 0.138 0.305 0.400 0.269 0.317 7.6 yrs

+1GB DRAM 1.909 1.907 1.119 0.150 0.190 0.468 0.422 0.288 0.807 3.0 yrs

+ Lazy Write 1.854 1.866 1.004 0.078 0.096 0.354 0.274 0.276 0.725 3.4 yrs

+ LLWB 0.324 0.327 0.797 0.078 0.096 0.354 0.274 0.276 0.316 7.6 yrs

+ PLB 0.324 0.327 0.797 0.078 0.096 0.354 0 0 0.247 9.7 yrs

32

Page 10: Scalable high performance main memory system using phase-change memory technology

8. SUMMARYThe need for memory capacity continues to increase while the

main memory system consisting of DRAM has started hitting thecost and power wall. An emerging memory technology, PhaseChange Memory (PCM), promises much higher density than DRAMand can increase the main memory capacity substantially. However,PCM comes with the drawback of increased access latency and lim-ited number of writes. In this paper we studied the impact of usingPCM as main memory and make the following contributions:

1. To our knowledge, this is the first architectural study that pro-poses and evaluates PCM for main memory systems. We pro-vide an abstract model for PCM and show that for currentlyprojected values, PCM can increase main memory capacityby 4X and reduce page faults by 5X on average.

2. As PCM is slower than DRAM, we recommend that it beused in conjunction with a DRAM buffer. Our evaluationsshow that a small DRAM buffer (3% the size of PCM stor-age) can bridge the latency gap between PCM and DRAM.We also show that for a wide variety of workloads the PCM-based hybrid system provides an average speedup of 3X whilerequiring only 13% area overhead.

3. We propose three techniques: Lazy Write, Line Level Write-back, and Page Level Bypass to reduce the write traffic toPCM. We show that these simple techniques can reduce thewrite traffic by 3X and increase the average lifetime of PCMfrom 3 years to 9.7 years.

4. We propose Fine Grained Wear Leveling (FGWL), a simpleand low-overhead technique to make the wear-out of PCMstorage uniform across all lines in a page. FGWL requires4-bit per 4KB page.

We believe the architecture-level abstractions used in this workwill serve as a starting point for system architects to address thechallenges posed by PCM, which will make PCM attractive for themain memory of future systems.

Acknowledgments

The authors thank Ravi Nair, Robert Montoye, Partha Ranganathanand the anonymous reviewers for their comments and feedback. Wealso thank Pradip Bose for his support during this work.

9. REFERENCES

[1] The Basics of Phase Change Memory Technology.http://www.numonyx.com/Documents/WhitePapers/PCM_Basics_WP.pdf.

[2] International Technology Roadmap for Semiconductors,ITRS 2007.

[3] Apple Computer Inc. Apple Products. http://www.apple.com.[4] F. Bedeschil et al. A multi-level-cell bipolar-selected

phase-change memory. In 2008 IEEE InternationalSolid-State Circuits Conference, pages 428–430, Feb. 2008.

[5] E. Doller. Flash Memory Trends and Technologies. IntelDeveloper Forum, 2006.

[6] E.Grochowski and R. Halem. Technological impact ofmagnetic hard disk drives on storage systems. IBM Systems.Journal, 42(2):338–346, 2003.

[7] M. Ekman and P. Stenstrom. A case for multi-level mainmemory. InWMPI ’04: Proceedings of the 3rd workshop onMemory performance issues, pages 1–8, 2004.

[8] M. Ekman and P. Stenstrom. A cost-effective main memoryorganization for future servers. In IPDPS ’05: Proceedingsof the 19th IEEE International Parallel and DistributedProcessing Symposium, 2005.

[9] R. Freitas and W. Wilcke. Storage-class memory: The nextstorage system technology. IBM Journal of R. and D.,52(4/5):439–447, 2008.

[10] HP.Memory technology evolution: an overview of systemmemory technologies, technology brief, 7th edition, 1999.

[11] C.-G. Hwang. Semiconductor memories for IT era. In 2002IEEE International Solid-State Circuits Conference, pages24–27, Feb. 2002.

[12] J. Javanifard et al. A 45nm Self-Aligned-Contact Process1Gb NOR Flash with 5MB/s Program Speed. In 2008 IEEEInternational Solid-State Circuits Conference, pages424–426, Feb. 2008.

[13] T. Kgil, D. Roberts, and T. Mudge. Improving NAND FlashBased Disk Caches. In ISCA ’08: Proceedings of the 35thannual international symposium on Computer architecture,pages 327–338, 2008.

[14] K. J. Lee et al. A 90nm 1.8V 512Mb Diode-Switch PRAMwith 266 MB/s Read Throughput. isscc08, 43(1):150–162,2008.

[15] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler,and T. W. Keller. Energy management for commercialservers. IEEE Computer, 36(12):39–48, Dec. 2003.

[16] M-Systems. TrueFFS Wear-leveling Mechanism.http://www.dataio.com/pdf/NAND/MSystems/TrueFFS_Wear_Leveling_Mechanism.pdf.

[17] P. Machanick. The Case for SRAM Main Memory.Computer Architecture News, 24(5):23–30, 1996.

[18] P. Mangalagiri, K. Sarpatwari, A. Yanamandra,V. Narayanan, Y. Xie, M. J. Irwin, and O. A. Karim. Alow-power phase change memory based hybrid cachearchitecture. In Proceedings of the 18th ACM Great Lakessymposium on VLSI, pages 395–398, 2008.

[19] J. McCalpin. Memory bandwidth and machine balance incurrent high performance computers. IEEE ComputerSociety Technical Committee on Computer Architecture(TCCA) Newsletter, Dec. 1995.

[20] MetaRAM, Inc. MetaRAM. http://www.metaram.com/.[21] Micron. Micron System Power Calculator.

http://www.micron.com/support/part_info/powercalc.[22] D. Nobunagal et al. A 50nm 8Gb NAND Flash Memory with

100MB/s Program Throughput and 200MB/s DDR Interface.In 2008 IEEE International Solid-State Circuits Conference,pages 426–427, Feb. 2008.

[23] S. R. Ovshinsky. Reversible electrical switching phenomenain disordered structures. Phys. Rev. Lett., 21(20), 1968.

[24] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y.-C.Chen, R. M. Shelby, M. Salinga, D. Krebs, S.-H. Chen, H.-L.Lung, and C. H. Lam. Phase-change random access memory:A scalable technology. IBM Journal of R. and D.,52(4/5):465–479, 2008.

[25] J. Tominaga, T. Kikukawa, M. Takahashi, and R. T. Phillips.Structure of the Optical Phase Change Memory Alloy,AgVInSbTe, Determined by Optical Spectroscopy andElectron Diffraction,. J. Appl. Phys., 82(7), 1997.

[26] C. Weissenberg. Current & Future Main Memory Technologyfor IA Platforms. Intel Developer Forum, 2006.

[27] N. Yamada, E. Ohno, K. Nishiuchi, and N. Akahira.Rapid-Phase Transitions of GeTe-Sb2Te3 PseudobinaryAmorphous Thin Films for an Optical Disk Memory. J. Appl.Phys., 69(5), 1991.

[28] D. Ye, A. Pavuluri, C. Waldspurger, B. Tsang, B. Rychlik,and S. Woo. Prototyping a hybrid main memory using avirtual machine monitor. In Proceedings of the IEEEInternational Conference on Computer Design, 2008.

33