Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
NVthreads: Practical Persistence for Multi-threaded Applications
Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University * Work was done at Hewlett Packard Labs.
NVMW 2018
❖ NVthreads was published in EuroSys 2017 ❖ This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.
What is non-volatile memory (NVM)?
2
• Key features: persistence, good performance, byte addressability
• Persistence
- Retain data without power
• Good performance
- Outperform traditional filesystem interface
• Byte addressability
- Allow for pure memory operations
4
☞Problem: Can we provide a simpler programming interface?
• NVM aware filesystems: BPFS, PMFS, PMEM
- Pro: provide good performance
- Con: require applications to use file-system interfaces and may need hardware modifications
• Durable transaction and heaps: NV-Heaps, Mnemosyne
- Pro: allow fine-grained NVM access
- Con: force programs to use transactions and require non-trivial effort to retrofit transactions in lock-based programs
Programming interfaces for NVM
8
NVM-aware apps programming
1 .head
5 .e
NULLtail
NVM
Challenges:1.data consistency
programmability volatile caches performance
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 :
5 :
6 : e->value = 5;
7 :
8 :
9 : e->next = NULL;
10:
11:
12: head->next = e; //crash
13:
14:
15: tail = e;
16: pthread_unlock(&m);
12: head->next = e; // crash
9
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 : value>
5 :
6 : e->value = 5;
7 : next>
8 :
9 : e->next = NULL;
10: next>
11:
12: head->next = e;
13:
14:
15: tail = e;
16: pthread_unlock(&m);
NVM-aware apps programming
NVM
1 .head
5 .e
NULLtail
Challenges:1.data consistency 2.programmability
volatile caches performance
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 : value>
5 :
6 : e->value = 5;
7 : next>
8 :
9 : e->next = NULL;
10: next>
11:
12: head->next = e;
13:
14:
15: tail = e;
16: pthread_unlock(&m); 10
NVM-aware apps programming
NVM
1 .head
5 .e
NULLtail
flushing…
Challenges:1.data consistency 2.programmability 3.volatile caches
performance
Cache
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 : value>
5 :
6 : e->value = 5;
7 : next>
8 :
9 : e->next = NULL;
10: next>
11:
12: head->next = e;
13:
14:
15: tail = e;
16: pthread_unlock(&m); 11
NVM-aware apps programming
NVM
1 .head
5 .e
NULLtail
Cache
Challenges:1.data consistency 2.programmability 3.volatile caches 4.performance
flushing…
• Data consistency
- Ensure data consistency even after crash
• Volatile caches
- Manage data movement from volatile caches to NVM
• Programmability
- Avoid extensive program modifications
• Performance - Minimize runtime overhead
13
Challenges of using NVM
!Proposal: NVthreads, a programming model and runtime that adds persistence to multi-threaded C/C++ programs
Goals of NVthreads• Make existing lock-based C/C++ applications crash tolerant
• Minimize porting effort
- Drop-in replacement for pthreads library
- No need for transactions
• Advantages of the NVthreads
- Good performance
- Easier to develop NVM-aware applications
14
Key ideas• Use synchronization points to infer consistent regions
(cf. Atlas [OOPSLA’14])
- Does not require applications to use transactions
• Execute multithreaded program as multi-process program (cf. DThreads [SOSP’11])
- Process memory buffers uncommitted writes
• Track data modifications at page granularity
- Amortizes logging overhead vs fine-grained tracking15
Unmodified C/C++ application
Using NVthreads• Ease of use:
19
bash$ gcc foo.c –o foo.out –rdynamic libnvthread.so –ldl
DRAMVolatile main memory
e.g., stacks
Operating systemMemory allocation and file system interface for
both DRAM and NVM
NVthreads libraryMulti-process, intercepting synchronization,
tracking data, maintaining log
Modifications• Allocate data in NVM: nvmalloc() • Recover data in NVM: nvrecover()
Add recovery code, specify persistent allocations
NVMPersistent regions
e.g., linked list on heap
User space
Kernel space
Hardware
Link to NVthreads library
DRAM
NVM
NVthreads: programming model
22
1 void main(){2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }
6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }
Locks mark boundary for durable code section.
NVthreads: programming model
23
1 void main(){2 if( crashed() ){3 int *c = (int*)nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }
Application specific recovery code.
Programer needs to add.
2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }
Example: linked list
25
• NVthreads guarantees that the linked list is atomically appended w.r.t. failures
1 : # L is a persistent list
2 : Start threads {T1, T2, T3}
3 : …
4 : # Add element to the tail of list
5 : pthread_lock(&m);
6 : nvmalloc(&e, sizeof(*e));
7 : e->val = localVal;
8 : tail->next = e;
9 : e->prev = tail; // crash!
10: tail = e;
11: pthread_unlock(&m)
Critical section (add e1)
Critical section (add e2)
Critical section (add e3)
L={} L={e1} L={e1, e2}NVM
T1
T2
T3
Recovery phase
(execute redo ops)
state of the list data structure “L”
9 : e->prev = tail; // crash!
Implementing atomic durability• Convert threads to processes (cf. DThreads [SOSP’11])
- Each process works on private memory, no undo log
• At synchronization points, propagate private updates, execute processes sequentially
• Track dirty pages and log them to NVM for recovery
- Apply redo log in the event of crash26
sharedaddress space disjointaddress spaces
From threads to processes
33
Pass token
Wait
Wait
T1
T2
Critical sectionParallelphase
Parallelphase
Execute Wait
Star
t NVM log
write
Merge shared state
Track dirty
pages Sto
p
Star
t NVM log
write
Merge shared state
Track dirty
pages Sto
p
Redo logging
34
Rego log
Shared state
T1
log dirty pages
sync()
merge updated
bytes
write back to NVM
NVM
Critical sectionParallel phaseClean page
Dirtied page
NVM
Tracking data dependencies
46
T1
T2
X=Y=0
Y=X
B
A
X=1 cond_wait()
cond_signal()
dependence
Log1 Log2 Log3NVthreads maintains metadata for memory pages
per lockset to track data dependencies.
Evaluation• Environment
- Ubuntu 14.04 (Linux 3.16.7)
- Two Intel Xeon X5650 processors ([email protected])
- 198GB RAM and 600GB SSD
• Applications
- PARSEC benchmarks, Phoenix benchmarks, PageRank, K-means
• NVM emulator
- Linux tmpfs on DRAM emulating nvmfs (provided by Hewlett Packard Labs)
- Injected 1000ns delay to each 4KB page write via RDTSCP instruction
47
Performance vs pthreads
48
• Phoenix and PARSEC benchmarks
• No recovery protocol
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
Performance vs pthreads
50
• 9 out of 14 applications: NVthreads incurs less than 20% overhead vs pthreads • Remaining 5 applications: 4x to 7x slowdown vs pthreads
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
52
• 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas
101.96 46.92
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
xx
Performance vs Atlas [OOPSLA’14]
53
• 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas • Remaining 2 applications: 7% to 2x slower vs Atlas
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
xx
Performance vs Atlas [OOPSLA’14]
Is coarse grained tracking a good fit?
54
• 9 out of 14 applications touch more than 55% of each page
• It is worthwhile to track data at page granularity in these apps
% o
f eac
h pa
ge m
odifi
ed
0102030405060708090
100
linea
r reg
ress
ion
(25)
strin
g m
atch
(37)
hist
ogra
m (4
4)bl
acks
chol
es (8
9)sw
aptio
ns (4
83)
mat
rix m
ultip
ly (4
K)
kmea
ns (1
0K)
pca
(11K
)w
ord
coun
t (12
K)
ferr
et (1
50K
)st
ream
clus
ter (
180K
)de
dup
(2.3
M)
reve
rse
inde
x (2
.7M
)ca
nnea
l (7.
4M)
• Microbenchmark: 4 threads randomly modify parts of 1000 memory pages
• Mnemosyne [ASPLOS’11] and Atlas [OOPSLA’14] use word-level tracking
• NVthreads is 3x to 30x faster than fine-grained tracking
56
NVthreads is faster than fine-grained trackingS
low
dow
n ov
er p
thre
ads
(x)
0255075
100125150175200225250
Percentage of page modified
5% 10% 25% 50% 75% 100%
NVthreads (nvm-1000ns) Atlas (no-clflush) Mnemosyne Atlas
• We made K-means crash at synthetic program points, recover, continue until convergence at ~160th iteration
• NVthreads’ K-means provides up to 1.9x speedup vs pthreads
• NVthreads requires only 4 SLOC changes to make K-means crash tolerant
58
Input size0
0.5
1
1.5
2
1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M
10 50 75 150
Spee
dup
over
pth
read
s
Iteration when crash occured
Pthreads NVthreads (nvm=1000ns)
Benefits of recovery (K-means)S
peed
up o
ver p
thre
ads
(x)
Summary• NVthreads allows programmers to easily leverage NVM
with just few lines of source code changes
• Recovery requires only redo log because multi-process execution buffers private updates
• Coarse-grained page-level tracking amortizes logging overheads
• NVthreads prototype is publicly available at:
https://github.com/HewlettPackard/nvthreads
61
https://github.com/HewlettPackard/nvthreads