Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
File systems for persistent memoryCS 839 - Persistence
Questions on homework?
• Can we shift the schedule and do BPFS on Thursday and Nova on Monday? Drop Aerie or SplitFS.
Learning outcomes
• Understand how disk-based file systems update metadata and handle consistency
• Understand the properties of NVM that can change file system design
• Understand the key ordering requirements for file systems
• Understand BPFS software and hardware mechanisms and their limitations
Background story
• PCM is becoming popular, first for main memory
• Obvious approach seems to be use it for file systems too
• Question: how do you optimize?
Background: normal file systems
• Use page cache to buffer data in DRAM
• Access SSD through block layer
• Use logging for consistency
Background: FS data structures
• Standard FS data structures• Superblock: describes FS parameters, location of root inode
• Inode: metadata for a single file• Attributes, size, location of data blocks
• Number
• Data block: holds file or directory contents
• Directory entry: String name and inode number
• Inode and data block bitmaps: track free/used locations on storage
• Indirect block: location of other data blocks or indirect blocks
Background: FS consistency
• What gets updated when appending to a file?• Allocate block from data bitmap
• Write data to data block
• Write block address to inode or indirect block
• Update file length & modification time in inode
• What happens if system crashes in the middle?
Background: FS consistency mechanisms
• Journaling: write metadata (and/or data) to a journal before writing it in place – redo logging• Write journal, force to storage
• Later checkpoint – write metadata/data in real place
• Can skip data journaling for performance
• Shadow updates: write data/metadata updates to new location (used in BPFS)• Basically copy-on write data structures
Review 1: Journaling
• Write to journal, then write to file system
A B
file system
journal A’ B’
B’A’
• Reliable, but all data is written twice9
Review 2: Shadow Paging
• Use copy-on-write up to root of file system
BA A’ B’
file’s root pointer
• Any change requires bubbling to the FS root
• Small writes require large copying overhead 10
Atomicity requirements
• What happens when you crash while writing data to a file?1. Entire write takes place or none takes place
2. Some blocks may be written entirely but not all
3. Arbitrary bytes of file may be replaced
• What do normal file systems do?• “torn write” – partially written block
• Data vs metadata journaling
Basic idea: RAM disk
• Idea 1: RAM disk• Make a block device that access
NVM instead of going to a device
• BTT: block-translation table, uses shadow updates to allow atomic block-sized writes
• Problems:• Still copy data to DRAM –
inefficient
• All writes are block sized --inefficient
What changes with NVM/SCM/PMem?
What changes with NVM/SCM/PMem?
• Fine grained writes• Don’t have to write entire blocks when updating a single value
• Fast random access• Don’t need to optimize metadata for sequential extents
• No buffering• Can serve data directly from memory
• But:• Loss of ordering
Short-Circuit Shadow Paging
• Uses byte-addressability and atomic 64b writes
BA A’ B’
file’s root pointer
15
• Inspired by shadow paging– Optimization: In-place update when possible
Opt. 1: In-Place Writes
• Aligned 64-bit writes are performed in place• Data and metadata
file’s root pointer
in-place write
16
• Appends committed by updating file size
file’s root pointer + size
in-place append
file size update
17
Opt. 2: Exploit Data-Metadata Invariants
BPFS Example
directory filedirectory
inodefile
root pointer
indirect blocks
inodes
add entry
remove entry
18
• Cross-directory rename bubbles to common ancestor
What happens if you memory-map a file?
• Rely on hardware for 1-word atomic update
➢CPU cache may reorder writes to NVM• Breaks “crash-consistent” update protocols`
Consistent updates
20
0xC02
Write-back Cache
0
NVM
0xDEADBEEFvalue
valid
value
valid1
1
STORE value = 0xC02STORE valid = 1
Primitive operation: ordering writes
• Why?• Ensures ability to commit a change
• How?• Flush – MOVNTQ/CLFLUSH
• Fence – MFENCE
• Inefficiencies:• Removes recent data from cache
21
0
NVMWrite-back cache
0xDEADBEEFvalue
valid
value
valid 1
0xC02
STORE value = 0xC02FLUSH (&value)FENCESTORE valid = 1
BPRAM
L1 / L2
...
CoW
Commit
...
Ordering in BPFS
22
...
CoW
Commit
...
Atomicity in BPFS
L1 / L2
BPRAM
23
Enforcing Ordering and Atomicity
• Ordering• Solution: Epoch barriers to declare constraints
• Faster than write-through
• Important hardware primitive (cf. SCSI TCQ)
• Atomicity• Solution: Capacitor on DIMM
• Simple and cheap!
24
Intel x86 flush mechanism
25
A
ST ACLWB ASFENCE
ACK
SFENCECOMMITS
A
ST A
ST B
CLWB A
CLWB B
SFENCE
ST C
CLWB C
SFENCE
25
Intel x86 flush mechanism
26
Drawback 1: No distinction between ordering and durability
Drawback 2: Ordering introduces stalls
26
ST A
ST B
CLWB A
CLWB B
SFENCE
ST C
CLWB C
SFENCE
Epoch ordering
• Goal:• No software flushes – too expensive/complex
• Ordering is asynchronous – too expensive to stall
• Solution• Persist barriers
Persist barriers: Ordering Fence
28
ST A=1Volatile Memory Order
ST A=1Persistence OrderTime
ST B=2
ST B=2
Thread 1 Thread 1Barrier
• Orders stores preceding barrier before later stores
Happens Before
Ordering Epochs without Flushing
29
CPU 1
Local TS
L1 Cache
2526
1. ST A = 12. ST B = 13. LD R1 = A4. BARRIER5. ST A = 2
A = 1 25A = 2 26
B = 1 25
...
CoW
Barrier
Commit
...
Ordering and Atomicity with Epoch Barriers
L1 / L2
BPRAM
1
1 1
2
Ineligible for eviction!
30
Epoch ordering complexity
• When is it safe to let something leave the cache?• When all writes from preceding epoch have left already
• What happens if you overwrite something from a preceding epoch?• Must flush earlier epoch first – can’t store multiple versions
• What happens when you access something from another core?• Can’t track ordering across cores (epoch numbers across cores aren’t orderd)
• Old data must be flushed
• How do you implement efficiently?• Store 8-bit pointer in each cacheline to registers holding 8 in-flight epochs
Considerations for epoch ordering
• How complex is it?
• How easy to use is it?
Considerations for epoch ordering
• How complex is it?• Need hardware walkers to evict cachelines during cache replacement
• How easy to use is it?• Dependencies across volatile variables not recorded
• Example:
• Could reboot with Y=2, A=4
Acquire(vol_lock);X = 1;Y = 2;Release(vol_lock);
Acquire(vol_lock);A = 4;B = 5;Release(vol_lock);
0
2
4
6
8
10
8 64 512 4096Th
ou
san
ds
Random n Byte Write
Microbenchmarks
0
0.4
0.8
1.2
1.6
2
8 64 512 4096
Tim
e (
s)
Append n Bytes
NTFS - DiskNTFS - RAMBPFS - RAM
34
NOT DURABLE!
NOT DURABLE!
DURABLE!
DURABLE!
Notes from reviews
• How much performance improvement should we expect?
• How important is using real PCM (or real PCM latency) in evaluation?
• Could we have systems with just Pmem and no SSD?
• What journaling mode does NTFS use?• Ordered journaling
• Is modifying HW ok?
• Using volatile structures• Free blocks, freed & allocate inode numbers, • data freed by CoW operation• Dentry cache
How well does it perform?
• Evaluation:• Implement in Windows & run over DRAM (no epoch barrier delays)
• Implement in usermode & run in a simulator
• Analytical model
• Workloads