Upload
lydia-clerk
View
217
Download
3
Embed Size (px)
Citation preview
Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG
National University of Singapore
ESWEEK (CASES) October 2014
EnVM : Virtual Memory Design for New Memory Architectures
2International Conference on Compilers, Architectures and Synthesis of Embedded Systems
New Memory Architectures
• NVMs (STT-RAM, MRAM, etc.)– Energy efficient– Higher density– High write latency (3x slower than reads)– Low write endurance
• Solution Hybrid Memories
NVM SRAM/DRAM
3International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Hybrid Caches
• SRAM + STT-RAM hybrid design• Data allocation
– Reducing writes to NVM partition– Redirecting write intensive data to SRAM partition
• Performance Impact– Data movement between partitions is expensive
• Energy Impact– High writes to NVM might offset energy savings
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Motivation
• Different solutions (previous works) for each level of memory– Not co-operative. Conflicting.– Not holistic for hybrid memory hierarchy
4 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Motivation
♯ Stack data layout for hybrid L1 cache (Li et.al. ISLPED’12)
♯ Reuse distance based data allocation for hybrid L2 cache (Chen et.al. LCTES’12)
dcba x1
x2x3x4 a
dbc x1
x2x3x4 <x1,x2,x4,x1,x4,x1,x3
>
<x2,x2,x4,x3,x2,x3,x1>
Write reuse sequence
Read reuse sequence
Write intensive <x1,x4>
Read intensive <x2,x3>
Stack Stack
5 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Motivation
• Different solutions (previous works) for each level of memory– Not co-operative. Conflicting.– Not holistic for hybrid memory hierarchy
• Hardware solutions heavy modifications and energy overheads
• Software solutions partial support or profile based techniques
6 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Our Approach - EnVM
• Makes use of virtual memory to provide for all hybrid memory hierarchy
• Handles static and dynamic data, no profiling required
• Utilizes existing hardware
• Advocates migration-less cache design
7 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
EnVM
8 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Static Analysis
a = pb = q
a = a-5b = b*2
c = p+qd = p-q
b1
b2 b3
b4
a (0,1)
p (1,0)
b (0,1)
q (1,0)
c (0,1)
d (0,1)
b (1,2)
a (1,2)
p (2,0)
q (2,0)q (3,0)
p (3,0)
(variable, read count, write count)
• Abstract interpretation based dataflow analysis• Heuristic estimate of memory access intensity
9 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Read Intensive
Write intensive
Static Analysis
c (0,1)
d (0,1)
b (1,2)
a (1,2)
q (3,0)
p (3,0)
• Clustering based on unsupervised machine learning algorithm• Classification to 4 classes and then to 2 partition• Read intensive allocated to STT-RAM partition• Write intensive allocated to SRAM partition
Classes• Low Read – Low Write• Low Read – High Write• High Read – Low Write• High Read – High Write
STT-RAM SRAM
10 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
400.
perlb
ench
401.
bzip
2
403.
gcc
410.
bwav
es
416.
gam
ess
429.
mcf
433.
milc
434.
zeus
mp
435.
grom
acs
436.
cactu
sADM
437.
lesli
e3d
444.
nam
d
445.
gobm
k
447.
deal
II
450.
sopl
ex
453.
povr
ay
454.
calcu
lix
456.
hmm
er
458.
sjeng
459.
GemsF
DTD
462.
libqu
antu
m
464.
h264
ref
465.
tont
o
470.
lbm
471.
omne
tpp
473.
asta
r
481.
wrf
482.
sphi
nx3
483.
xala
ncbm
k0%
20%
40%
60%
80%
100%
Low Read - Low Write Low Read - High Write High Read - Low Write High Read - High Write
Benchmarks
Nu
mb
er
of
Va
ria
ble
s (
%)
Memory Access Types
Variables show high read OR write affinity
11 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Dynamic Memory
• Hard to analyze
• Exposed to programmer
• Dynamic memory library support– Enable dual heap structure
• Two distinct system calls
(r_malloc, w_malloc)
12 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
EnVM Layout
Existing virtual memory layout
Proposed virtual memory layout
•X86 Segment registers do boundary checking
•Minimum modification to fit other architectures
•Allocating the data from each segment to either STT-RAM or SRAM
13 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Evaluation
• Comparison– Hardware only method (HW) on hybrid L1– Software method based on stack layout (SW1) on hybrid L1– Software method based on reuse distance (SW2) on hybrid L2– Our method on hybrid L1 (EnVM)
MARSSx86 Cycle Accurate Simulator
Processor : Unicore, 3GHz, Commit Width - 4
Memory - Hybrid L1 Design
L1 I-Cache (SRAM) 64K, 64B Line, 3 Cycles
L1 D-Cache (Hybrid) SRAM : 4K, 4-way, 3 Cycles
STTRAM : 64K, 4-way, Read - 3 Cycle, Write - 10 Cycles
L2 (SRAM) 2M, 8-way, 15 Cycles, 64B Lines
Memory - Hybrid L2 Design
L1 I-Cache (SRAM) 64K, 8-way, 3 Cycles, 64B Line
L1 D-Cache (SRAM) 32K, 8-way, 3 Cycles, 64B Line
L2 (Hybrid) SRAM : 1M, 4-way, 3 Cycles
STTRAM : 2M, 8-way, Read - 11 Cycle, Write - 30 Cycles
14 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Write Reduction
• Normalized to HW• Reduces 47.6% (HW) & 15% (SW1)
400.
perlb
ench
401.
bzip
2
403.
gcc
429.
mcf
445.
gobm
k
456.
hmm
er
458.
sjen
g
462.
libqu
antu
m
464.
h264
ref
471.
omne
tpp
473.
asta
r
483.
xala
ncbm
k
INT
AVG
410.
bwav
es
416.
gam
ess
433.
milc
434.
zeus
mp
435.
grom
acs
436.
cact
usADM
437.
lesli
e3d
444.
nam
d
447.
deal
II
450.
sopl
ex
453.
povr
ay
454.
calcul
ix
459.
GemsF
DTD
465.
tont
o
470.
lbm
481.
wrf
482.
sphi
nx3
FP A
VG
AVERAGE
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SW2 EnVM SW1
Benchmarks
Norm
alized N
o.
of
Wri
tes t
o S
TT-R
AM
15 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Energy Savings
400.
perlb
ench
401.
bzip
2
403.
gcc
429.
mcf
445.
gobm
k
456.
hmm
er
458.
sjen
g
462.
libqu
antu
m
464.
h264
ref
471.
omne
tpp
473.
asta
r
483.
xala
ncbm
k
INT
AVG
410.
bwav
es
416.
gam
ess
433.
milc
434.
zeus
mp
435.
grom
acs
436.
cact
usADM
437.
lesli
e3d
444.
nam
d
447.
deal
II
450.
sopl
ex
453.
povr
ay
454.
calcul
ix
459.
GemsF
DTD
465.
tont
o
470.
lbm
481.
wrf
482.
sphi
nx3
FP A
VG
TOTA
L AVG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
SW2 EnVM SW1 HW
Benchmarks
Norm
alized E
nerg
y P
er
Instr
ucti
on
• Normalized to pure SRAM configuration• Max. energy reduction 50% for 458.sjeng• Reduces 21% (HW) & 6% (SW1)
16 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Performance Impact
400.
perlb
ench
401.
bzip
2
403.
gcc
429.
mcf
445.
gobm
k
456.
hmm
er
458.
sjen
g
462.
libqu
antu
m
464.
h264
ref
471.
omne
tpp
473.
asta
r
483.
xala
ncbm
k
INT
AVG
410.
bwav
es
416.
gam
ess
433.
milc
434.
zeus
mp
435.
grom
acs
436.
cact
usADM
437.
lesli
e3d
444.
nam
d
447.
deal
II
450.
sopl
ex
453.
povr
ay
454.
calcul
ix
459.
GemsF
DTD
465.
tont
o
470.
lbm
481.
wrf
482.
sphi
nx3
FP A
VG
AVERAGE
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
HW SW1 SW2 EnVM
Benchmarks
Norm
alized I
PC
• Normalized to pure SRAM configuration• Comparable IPC• Write latency is offset by bigger cache capacities
17 of 60
International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Summary
• Holistic management of process memory to aid hybrid memory hierarchy
• Reduces writes - 47.6% (HW) & 15% (SW1)
Reduces energy - 21% (HW) & 6% (SW1)• Minimum hardware modification• No profiling of applications• No migration of data• Improvements
– Dynamic memory management
18
19International Conference on Compilers, Architectures and Synthesis of Embedded Systems
Thank You