Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG

National University of Singapore

ESWEEK (CASES) October 2014

EnVM : Virtual Memory Design for New Memory Architectures

2International Conference on Compilers, Architectures and Synthesis of Embedded Systems

New Memory Architectures

• NVMs (STT-RAM, MRAM, etc.)– Energy efficient– Higher density– High write latency (3x slower than reads)– Low write endurance

• Solution Hybrid Memories

NVM SRAM/DRAM


Hybrid Caches

• SRAM + STT-RAM hybrid design• Data allocation

– Reducing writes to NVM partition– Redirecting write intensive data to SRAM partition

• Performance Impact– Data movement between partitions is expensive

• Energy Impact– High writes to NVM might offset energy savings

International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Motivation

• Different solutions (previous works) for each level of memory– Not co-operative. Conflicting.– Not holistic for hybrid memory hierarchy

4 of 60


Motivation

♯ Stack data layout for hybrid L1 cache (Li et.al. ISLPED’12)

♯ Reuse distance based data allocation for hybrid L2 cache (Chen et.al. LCTES’12)

dcba x1

x2x3x4 a

dbc x1

x2x3x4 <x1,x2,x4,x1,x4,x1,x3

>

<x2,x2,x4,x3,x2,x3,x1>

Write reuse sequence

Read reuse sequence

Write intensive <x1,x4>

Read intensive <x2,x3>

Stack Stack

5 of 60


Motivation

• Different solutions (previous works) for each level of memory– Not co-operative. Conflicting.– Not holistic for hybrid memory hierarchy

• Hardware solutions heavy modifications and energy overheads

• Software solutions partial support or profile based techniques

6 of 60


Our Approach - EnVM

• Makes use of virtual memory to provide for all hybrid memory hierarchy

• Handles static and dynamic data, no profiling required

• Utilizes existing hardware

• Advocates migration-less cache design

7 of 60


EnVM

8 of 60


Static Analysis

a = pb = q

a = a-5b = b*2

c = p+qd = p-q

b1

b2 b3

b4

a (0,1)

p (1,0)

b (0,1)

q (1,0)

c (0,1)

d (0,1)

b (1,2)

a (1,2)

p (2,0)

q (2,0)q (3,0)

p (3,0)

(variable, read count, write count)

• Abstract interpretation based dataflow analysis• Heuristic estimate of memory access intensity

9 of 60


Read Intensive

Write intensive

Static Analysis

c (0,1)

d (0,1)

b (1,2)

a (1,2)

q (3,0)

p (3,0)

• Clustering based on unsupervised machine learning algorithm• Classification to 4 classes and then to 2 partition• Read intensive allocated to STT-RAM partition• Write intensive allocated to SRAM partition

Classes• Low Read – Low Write• Low Read – High Write• High Read – Low Write• High Read – High Write

STT-RAM SRAM

10 of 60


400.

perlb

ench

401.

bzip

2

403.

gcc

410.

bwav

es

416.

gam

ess

429.

mcf

433.

milc

434.

zeus

mp

435.

grom

acs

436.

cactu

sADM

437.

lesli

e3d

444.

nam

d

445.

gobm

k

447.

deal

II

450.

sopl

ex

453.

povr

ay

454.

calcu

lix

456.

hmm

er

458.

sjeng

459.

GemsF

DTD

462.

libqu

antu

m

464.

h264

ref

465.

tont

o

470.

lbm

471.

omne

tpp

473.

asta

r

481.

wrf

482.

sphi

nx3

483.

xala

ncbm

k0%

20%

40%

60%

80%

100%

Low Read - Low Write Low Read - High Write High Read - Low Write High Read - High Write

Benchmarks

Nu

mb

er

of

Va

ria

ble

s (

%)

Memory Access Types

Variables show high read OR write affinity

11 of 60


Dynamic Memory

• Hard to analyze

• Exposed to programmer

• Dynamic memory library support– Enable dual heap structure

• Two distinct system calls

(r_malloc, w_malloc)

12 of 60


EnVM Layout

Existing virtual memory layout

Proposed virtual memory layout

•X86 Segment registers do boundary checking

•Minimum modification to fit other architectures

•Allocating the data from each segment to either STT-RAM or SRAM

13 of 60


Evaluation

• Comparison– Hardware only method (HW) on hybrid L1– Software method based on stack layout (SW1) on hybrid L1– Software method based on reuse distance (SW2) on hybrid L2– Our method on hybrid L1 (EnVM)

MARSSx86 Cycle Accurate Simulator

Processor : Unicore, 3GHz, Commit Width - 4

Memory - Hybrid L1 Design

L1 I-Cache (SRAM) 64K, 64B Line, 3 Cycles

L1 D-Cache (Hybrid) SRAM : 4K, 4-way, 3 Cycles

STTRAM : 64K, 4-way, Read - 3 Cycle, Write - 10 Cycles

L2 (SRAM) 2M, 8-way, 15 Cycles, 64B Lines

Memory - Hybrid L2 Design

L1 I-Cache (SRAM) 64K, 8-way, 3 Cycles, 64B Line

L1 D-Cache (SRAM) 32K, 8-way, 3 Cycles, 64B Line

L2 (Hybrid) SRAM : 1M, 4-way, 3 Cycles

STTRAM : 2M, 8-way, Read - 11 Cycle, Write - 30 Cycles

14 of 60


Write Reduction

• Normalized to HW• Reduces 47.6% (HW) & 15% (SW1)

400.

perlb

ench

401.

bzip

2

403.

gcc

429.

mcf

445.

gobm

k

456.

hmm

er

458.

sjen

g

462.

libqu

antu

m

464.

h264

ref

471.

omne

tpp

473.

asta

r

483.

xala

ncbm

k

INT

AVG

410.

bwav

es

416.

gam

ess

433.

milc

434.

zeus

mp

435.

grom

acs

436.

cact

usADM

437.

lesli

e3d

444.

nam

d

447.

deal

II

450.

sopl

ex

453.

povr

ay

454.

calcul

ix

459.

GemsF

DTD

465.

tont

o

470.

lbm

481.

wrf

482.

sphi

nx3

FP A

VG

AVERAGE

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SW2 EnVM SW1

Benchmarks

Norm

alized N

o.

of

Wri

tes t

o S

TT-R

AM

15 of 60


Energy Savings

400.

perlb

ench

401.

bzip

2

403.

gcc

429.

mcf

445.

gobm

k

456.

hmm

er

458.

sjen

g

462.

libqu

antu

m

464.

h264

ref

471.

omne

tpp

473.

asta

r

483.

xala

ncbm

k

INT

AVG

410.

bwav

es

416.

gam

ess

433.

milc

434.

zeus

mp

435.

grom

acs

436.

cact

usADM

437.

lesli

e3d

444.

nam

d

447.

deal

II

450.

sopl

ex

453.

povr

ay

454.

calcul

ix

459.

GemsF

DTD

465.

tont

o

470.

lbm

481.

wrf

482.

sphi

nx3

FP A

VG

TOTA

L AVG

0

0.2

0.4

0.6

0.8

1

1.2

1.4

SW2 EnVM SW1 HW

Benchmarks

Norm

alized E

nerg

y P

er

Instr

ucti

on

• Normalized to pure SRAM configuration• Max. energy reduction 50% for 458.sjeng• Reduces 21% (HW) & 6% (SW1)

16 of 60


Performance Impact

400.

perlb

ench

401.

bzip

2

403.

gcc

429.

mcf

445.

gobm

k

456.

hmm

er

458.

sjen

g

462.

libqu

antu

m

464.

h264

ref

471.

omne

tpp

473.

asta

r

483.

xala

ncbm

k

INT

AVG

410.

bwav

es

416.

gam

ess

433.

milc

434.

zeus

mp

435.

grom

acs

436.

cact

usADM

437.

lesli

e3d

444.

nam

d

447.

deal

II

450.

sopl

ex

453.

povr

ay

454.

calcul

ix

459.

GemsF

DTD

465.

tont

o

470.

lbm

481.

wrf

482.

sphi

nx3

FP A

VG

AVERAGE

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

HW SW1 SW2 EnVM

Benchmarks

Norm

alized I

PC

• Normalized to pure SRAM configuration• Comparable IPC• Write latency is offset by bigger cache capacities

17 of 60


Summary

• Holistic management of process memory to aid hybrid memory hierarchy

• Reduces writes - 47.6% (HW) & 15% (SW1)

Reduces energy - 21% (HW) & 6% (SW1)• Minimum hardware modification• No profiling of applications• No migration of data• Improvements

– Dynamic memory management

18


Thank You

Documents

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures