Maciej Maciejewski, Krzysztof Czuryło LinuxCon, Berlin ’16...Kernel Space Standard File API...

Preview:

Citation preview

Maciej Maciejewski, Krzysztof Czuryło

LinuxCon, Berlin ’16

Salvador DalíThe Persistence of Memory

2

3

Establishing the Open Industry NVM Programming Model

36+ Member Companies

http://snia.org/sites/default/files/NVMProgrammingModel_v1.pdf

SNIA Technical Working GroupDefined 4 programming modes required by developers

Spec 1.0 developed, approved by SNIA voting members and published

Interfaces for PM-aware file system accessing

kernel PM support

interfaces for application accessing a PM-aware file

system

Kernel support for block NVM

extensions

Interfaces for legacy applications to access block NVM extensions

4

Persistent memory programming model

5

NVDIMM Hardware

UserSpace

KernelSpace

Standard

File API

NVDIMM Driver

Application

File System

ApplicationApplication

Standard

Raw Device

Access

Load/Store

Management Library

Management UI

Standard

File API

pmem-AwareFile System

MMU

Mappings

OPERATING SYSTEM

6

ACPI NFIT

7

E820

8

What can we do with it?

9

Memory

10

Memory - libvmmalloc

11

libvmem

jemalloc

Application

PM

KernelSpace

UserSpace

malloc

vmmalloc

mmap

DRAM

vmem_pool_mallocvmem_pool_create

constructor

temporary file

Memory - libvmem

12

libvmem

jemalloc (3.6.0)

Application

Persistent Memory

KernelSpace

UserSpace

libc(or other)

vmem_malloc

vmem

fallocate/mmap

malloc

DRAM

mmap/sbrk

vmem_create

vmem_pool_mallocmalloc

vmem_pool_create

temporary file

Memory - https://github.com/memkind/memkind

13

Block storage

14

15

Byte persistency

16

Data persistency

17

Atomicity

18

Flushing to Persistence

open(…);

mmap(…);

strcpy(pmem, „Hello");

msync(pmem, 6, MS_SYNC);

pmem_persist(pmem, 6);

strcpy(pmem, „Hello, World!");

pmem_persist(pmem, 14);

Crossing the 8-Byte Store

Result?

1. „\0\0\0\0\0\0\0\0\0\0...”

2. „Hello, W\0\0\0\0\0\0...”

3. „\0\0\0\0\0\0\0\0orld!\0”

4. „Hello, \0\0\0\0\0\0\0\0”

5. „Hello, World!\0”crash

Location

files/pools

mmap(2)

allocation mechanism

bookkeeping

replication & recovery

19

20

PersistentMemory

UserSpace

KernelSpace

Application

Load/Store

MMUMappings

NVDIMM Driver

file

StandardFile API

PM-awareFile System

NVML

http://pmem.io

https://github.com/pmem/nvml/

NVML

21

nvml Persistent Libraries

libpmem – Basic persistency handling

libpmemblk – Block access to persistent memory

libpmemlog - Log file on persistent memory (append-mostly)

libpmemobj - Transactional Object Store on persistent memory

libpmempool – Pool management utilities

librpmem - Replication

22

https://github.com/pmem/nvml/tree/master/src/examples

http://pmem.io/blog/

Applications

23

Modifying application allocations

Which objects to store at PM?

How to distinguish whether it’s better to allocate/store at HBM/DRAM/PM/SSD/HDD

Do all need to be persistent?

When to guarantee persistence?

24

Modifying application engine

for (int i=0; i<NUMBER_OF_ITERATIONS; i++) {result = calculateThis(i, result);

}

25

i_pm = 0; //at first runtime of app only

...

...

for (i_pm; i_pm<NUMBER_OF_ITERATIONS; i_pm++) {TX_BEGIN(pool) {

result_pm = calculateThis(i_pm, result_pm);} TX_END

} i_pm = 0;

Redis key/value store

26

*All following results come from some old developer machine with Persistent Memory emulated on DDR3

27

0

5

10

15

20

25

30

32 64 128 256 512 1024 2048 4096 8192

seco

nd

s

object size

Startup time

RDB

AOF

PM

28

0

1000

2000

3000

4000

5000

6000

7000

8000

32 64 128 256 512 1024 2048 4096 8192

DR

AM

all

oca

tio

ns

[MB

]

objects size

DRAM usage

AOF

RDB

PM

29

0

20000

40000

60000

80000

100000

120000

36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141

op

era

tio

ns

/ se

c

% of OS memory

Running out of DRAM

No Persist

30

0

20000

40000

60000

80000

100000

120000

36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141

op

era

tio

ns

/ se

c

% of OS memory

Running out of DRAM

No Persist

RDB

31

0

20000

40000

60000

80000

100000

120000

36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141

op

era

tio

ns

/ se

c

% of OS memory

Running out of DRAM

No Persist

RDB

AOF

32

0

20000

40000

60000

80000

100000

120000

36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141

op

era

tio

ns

/ se

c

% of OS memory

Running out of DRAM

No Persist

RDB

PM

AOF

33

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

32 64 128 256 512 1024 2048 4096 8192

Redis, Transactional API

pmem

AOF

34

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

32 64 128 256 512 1024 2048 4096 8192

Redis, Transactional API

pmem

AOF

PM 4x

AOF 4x

35

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

32 64 128 256 512 1024 2048 4096 8192

Redis, Transactional API

pmem

AOF

PM 10x

AOF 10x

PM 4x

AOF 4x

36

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

2

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

By

tes/

se

con

d

Bil

lio

ns

Object size [B]

Total Data throughput

pmem

AOF

pmem 10x

AOF 10x

PETSc

Portable, Extensible Toolkit for Scientific Computation

A suite of data structures and routines developed by Argonne National Laboratory for the scalable (parallel) solution of scientific applications modeled by partial differential equations.

It employs the Message Passing Interface (MPI) standard for all message-passing communication.

PETSc is the world’s most widely used parallel numerical software library for partial differential equations and sparse matrix computations with over 760 publications.

PETSc includes a large suite of parallel linear and nonlinear equation solvers that are easily used in application codes written in C, C++, Fortran and now Python.

PETSc provides mechanisms needed within parallel application code, that allow the overlap of communication and computation.

https://www.mcs.anl.gov/petsc/

37

Sparse Matrix multiplication

Original:

Time (sec): 2.070e+03

Memory: 1.361e+10

Count Time (sec)

MatAssemblyBegin 4 1.4782e-05

MatAssemblyEnd 4 2.8858e+00

MatLoad 2 1.6146e+01

MatMatMultSym 1 9.0468e+02

MatMatMultNum 1 1.1468e+03

Persistent Memory:

Time (sec): 2.109e+03

Memory: 1.048e+06

Count Time (sec)

MatAssemblyBegin 2 7.1526e-06

MatAssemblyEnd 2 2.6408e+00

MatLoad 2 3.6855e-03

MatMatMultSym 1 9.1570e+02

MatMatMultNum 1 1.1929e+03

Compute 2% slower

Preparation 680% faster

38

Sparse matrix solverOriginal Matrixes in PM All in PM

MatAssemblyBegin 2.6226e-06 1.6443e-02 1.6618e-02

MatAssemblyEnd 1.1275e-01

MatLoad 1.7201e+00

MatMult 9.3791e-02 1.1111e-01 1.1177e-01

VecSet 5.0273e-03 4.6258e-03 9.6970e-03

MatMult 2.6623e+01 2.8851e+01 2.9268e+01

MatView 9.5606e-05 9.3222e-05 1.0443e-04

VecMDot 7.0743e+00 7.0108e+00 8.3750e+00

VecNorm 6.4105e-01 6.4236e-01 6.4869e-01

VecScale 7.0098e-01 7.0180e-01 7.0580e-01

VecCopy 1.5669e-02 1.5939e-02 1.5724e-02

VecSet 4.0993e-02 4.0953e-02 8.0090e-02

VecAXPY 6.5127e-02 6.6045e-02 6.8618e-02

VecMAXPY 1.0328e+01 1.0488e+01 8.7134e+00

VecPointwiseMult 1.1661e+00 1.2202e+00 1.3907e+00

VecNormalize 1.3422e+00 1.3443e+00 1.3546e+00

KSPGMRESOrthog 1.6762e+01 1.6844e+01 1.6556e+01

KSPSetUp 4.5981e-03 4.6084e-03 2.6381e+00

KSPSolve 4.6762e+01 4.9142e+01 6.7208e+01

PCSetUp 2.6226e-06 2.3842e-06 2.8610e-06

PCApply 1.2530e+00 1.3039e+00 1.4794e+00

Time [sec]

Stage 1: File loading

Stage 2: Vector duplication and multiplication

Stage 3: Solver stage

0

100

200

300

400

500

600

Time [s] Memory [MB]

48,67

557

49,33

260

74,89

7

GMRES Sparse Matrix solver

Original Matrix data in PM All in PM

39

No universal receipt

Different data usage scenarios

A lot of architectural work

Even more coding

40Credit: Uwe Kils http://www.ecoscope.com/iceberg/

SNIAProgramming models

NVMLHW

RDMA

RAS

Languages bindingsreplication

OS addressing space limit

OS boot up / hibernation

POSIX

Wear leveling

JVM memory management

Virtualization

Space management

TLB

Recommended