Scalable High Performance Main Memory System Using PCM Technology

© 2007 IBM CorporationInternational Symposium on Computer Architecture (ISCA-2009) Apr 19, 2023

Scalable High Performance Main Memory System Using PCM Technology

Moinuddin K. QureshiViji Srinivasan and Jude Rivers

IBM T. J. Watson Research Center, Yorktown Heights, NY

2 © 2007 IBM Corporation

Main Memory Capacity Wall

More cores in system More concurrency Larger working set

Demand for main memory capacity continues to increase

Main Memory System consisting of DRAM are hitting:1. Cost wall: Major % of cost of large servers is main memory2. Scaling wall: DRAM scaling to small technology is challenge3. Power wall:

IBM P670 Server Processor Memory

Small (4 proc, 16GB) 384 Watts 314 Watts

Large (16 proc, 128GB) 840 Watts 1223 Watts

Source: Lefurgy et al. IEEE Computer 2003

Need a practical solution to increase main-memory capacity


The Technology Hierarchy

More capacity by cheaper, denser, (slower) technology

21 23 27 211 213 215 219 223

Typical access latency in processor cycles (@ 4 GHz)

L1(SRAM) EDRAM DRAM HDD

25 29 217 221

Flash PCM

Phase Change Memory (PCM) promising candidate for large capacity main memory

High-Performance DiskMemory System


Outline

Introduction What is PCM ? Hybrid Memory System Evaluation Lifetime Analysis Summary


What is Phase Change Memory?

Phase change material (chalcogenide glass) exists in two states:1. Amorphous: high resistivity2. Crystalline: low resistivity

Materials can be switched between states reliably, quickly, large number of times

PCM stores data in terms of resistance• Low resistance (SET state) = 1• High resistance (RESET state) = 0

I

Bit Line

WordLine

N

WordLine

N N


How does PCM work ?

Tmelt

Tcryst

Time [ns]

RESET

SET

Tem

per

atu

re

Switching by heating using electrical pulses

SET: sustained current to heat cell above Tcryst

RESET: cell heated above Tmelt and quenched

LargeCurrent

SETLow resistance

103-104

SmallCurrent

RESETHigh resistance

AccessDevice

MemoryElement

106-107

Photo Courtesy: Bipin Rajendran, IBM


Key Characteristics of PCM

+ Scales better than DRAM, small cell size Prototypes as small as 3nm x 20 nm fabricated and tested [Raoux+ IBMJRD’08]

+ Can store multiple bits/cell More density in the same area Prototypes with 2 bits/cell in ISSCC’08. >2 bits/cell expected soon.

+ Non-Volatile Memory Technology Data retention of 10 years Power implications, system implications

Challenges: - More latency compared to DRAM. - Limited Endurance (~10 million writes per cell)- Write bandwidth constrained, so better to write less often.


Outline



Hybrid Memory System

Hybrid Memory System: 1. DRAM as cache to tolerate PCM Rd/Wr latency and Wr bandwidth2. PCM as main-memory to provide large capacity at good cost/power

DATA W

PCM Main Memory

DATAT

DRAM Buffer

PCM Write Queue

T=Tag-Store

Processor

FlashOr

HDD


Lazy Write Architecture

Problem: Double PCM writes to dirty pages on install

DRAM Buffer PCM

Flash/Disk

WRQ

Processor

For example: Daxpy Kernel: Y[i] = Y[i] + X[i]Baseline has 2 writes for Y[i] and 1 for X[i] Lazy write has 1 write for Y[i] and 1 for X[i]


0

2

4

6

8

10

12

14

16

18

20

db1 db2

Num

Writ

es P

er D

irty

Line

(Mill

ion)

Line Level Write Back

Problem: Not all lines in a dirty page are dirty

Solution: Dirty bits per line in DRAM buffer and

write-back only dirty lines from DRAM to PCM

Problem: With LLWB, not all lines in dirty pages are written uniformly

Num

Writ

es t

o E

ach

Line

(M

ln)

db1 db20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

AverageAverage

Line_id


0

2

4

6

8

10

12

14

16

18

20

db1 db2

Num

Writ

es P

er D

irty

Line

(Mill

ion)

Fine Grained Wear Leveling

Solution: Fine Grained Wear Leveling (FGWL)-When a page gets allocated page is rotated by a random shift value-The rotate value remains constant while page remains in memory-On replacement of a page, a new random value is assigned for a new page -Over time, the write traffic per line becomes uniform.

FGWL makes writes across lines in a dirty page uniform

Num

Writ

es t

o E

ach

Line

(M

ln)

db1 db20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Average Average

Line_id


Outline



Evaluation Framework

Trace Driven Simulator:16-core system (simple core), 8GB DRAM main-memory at 320 cyclesHDD (2 ms) with Flash (32 us) with Flash hit-rate of 99%

Workloads:Database workloads & Data parallel kernels

1. Database workloads: db1 and db22. Unix utilities: qsort and binary search3. Data Mining : K-means and Gauss Seidal4. Streaming: DAXPY and Vector Dot Product

Assumption:PCM 4X denser & 4X slower than DRAM 32GB @ 1280 cycle read latency


Reduction in Page Faults

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

db1 db2 qsort bsearch kmeans gauss daxpy vdotp

Pag

e F

au

lts N

orm

alize

d t

o 8

GB

Syste

m

4GB8GB16GB32GB

Benefit from capacity Need >16GB Streaming


Impact on Execution Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

db1 db2 qsort bsearch kmeans gauss daxpy vdotp gmean

No

rmal

ized

Exe

cuti

on

Tim

e

8GB DRAM32GB PCM32GB DRAM32GB PCM + 1GB DRAM

PCM with DRAM buffer performs similar to equal capacity DRAM storage


Impact of PCM Latency

Hybrid memory system is relatively insensitive to PCM Latency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1X 2X 4X 8X 16X 2X 4X 8X 16X 1X

No

rma

lize

d E

xe

c.

Tim

e (

Av

g)

DRAM-8GB DRAM-32GBPCM-32GB HYBRID (1+32)GB


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

Power Energy Energy x Delay

Va

lue

No

rma

lize

d t

o 8

GB

DR

AM 8GB DRAM

Hybrid (32GB PCM+ 1GB DRAM)

32GB DRAM

Power Evaluations

Significant Power and Energy savings with PCM based hybrid memory system


Outline



Impact of Write Endurance

F Frequency of System (4GHz)Y = Number of years (lifetime)

There are 225 seconds in a year

Num. cycles in Y years = Y. F.225

B Bytes/Cycle written to PCMS PCM capacity in bytesWmax Max writes per PCM cellAssuming uniform writes to PCM

Endurance (in cycles) = (S/B).Wmax

Y = (S/B). WmaxF.225

If Wmax = 10 million, PCM will last for 2.5 years

For a 4GHz System,a 32GB PCM written at

1 Byte per Cycle

Y = Wmax 4 million


Lifetime Results

Configuration Avg. Bytes/Cycle Avg. Lifetime

1GB DRAM + 32GB PCM 0.807 3.0 yrs

+ Lazy Write 0.725 3.4 yrs

+ Line Level Write Back 0.316 7.6 yrs

+ Bypass Streaming Apps 0.247 9.7 yrs

Table shows average bytes per cycle written to PCM and Average lifetime of PCM assuming Wmax = 10 million

Proposed filtering techniques reduce write traffic to PCM by 3.2X, increasing its lifetime from 3 to 9.7 years


Outline



Summary

Need more main memory capacity: DRAM hitting power, cost, scaling wall

PCM is an emerging technology – 4x denser than DRAM but with slower access time and limited write endurance

We propose a Hybrid Memory System (DRAM+PCM) that provides significant power and performance benefits

Proposed write filtering techniques reduce writes by 3x and increase PCM lifetime from 3 years to 9 years

Not touched in this talk but important: Exploiting non-volatile memories for system enhancement & related OS issues.


Thanks!

Documents

Scalable High Performance Main Memory System Using PCM Technology