File systems for persistent memory

File systems for persistent memoryCS 839 - Persistence

Questions on homework?

• Can we shift the schedule and do BPFS on Thursday and Nova on Monday? Drop Aerie or SplitFS.

Learning outcomes

• Understand how disk-based file systems update metadata and handle consistency

• Understand the properties of NVM that can change file system design

• Understand the key ordering requirements for file systems

• Understand BPFS software and hardware mechanisms and their limitations

Background story

• PCM is becoming popular, first for main memory

• Obvious approach seems to be use it for file systems too

• Question: how do you optimize?

Background: normal file systems

• Use page cache to buffer data in DRAM

• Access SSD through block layer

• Use logging for consistency

Background: FS data structures

• Standard FS data structures• Superblock: describes FS parameters, location of root inode

• Inode: metadata for a single file• Attributes, size, location of data blocks

• Number

• Data block: holds file or directory contents

• Directory entry: String name and inode number

• Inode and data block bitmaps: track free/used locations on storage

• Indirect block: location of other data blocks or indirect blocks

Background: FS consistency

• What gets updated when appending to a file?• Allocate block from data bitmap

• Write data to data block

• Write block address to inode or indirect block

• Update file length & modification time in inode

• What happens if system crashes in the middle?

Background: FS consistency mechanisms

• Journaling: write metadata (and/or data) to a journal before writing it in place – redo logging• Write journal, force to storage

• Later checkpoint – write metadata/data in real place

• Can skip data journaling for performance

• Shadow updates: write data/metadata updates to new location (used in BPFS)• Basically copy-on write data structures

Review 1: Journaling

• Write to journal, then write to file system

A B

file system

journal A’ B’

B’A’

• Reliable, but all data is written twice9

Review 2: Shadow Paging

• Use copy-on-write up to root of file system

BA A’ B’

file’s root pointer

• Any change requires bubbling to the FS root

• Small writes require large copying overhead 10

Atomicity requirements

• What happens when you crash while writing data to a file?1. Entire write takes place or none takes place

2. Some blocks may be written entirely but not all

3. Arbitrary bytes of file may be replaced

• What do normal file systems do?• “torn write” – partially written block

• Data vs metadata journaling

Basic idea: RAM disk

• Idea 1: RAM disk• Make a block device that access

NVM instead of going to a device

• BTT: block-translation table, uses shadow updates to allow atomic block-sized writes

• Problems:• Still copy data to DRAM –

inefficient

• All writes are block sized --inefficient

What changes with NVM/SCM/PMem?

What changes with NVM/SCM/PMem?

• Fine grained writes• Don’t have to write entire blocks when updating a single value

• Fast random access• Don’t need to optimize metadata for sequential extents

• No buffering• Can serve data directly from memory

• But:• Loss of ordering

Short-Circuit Shadow Paging

• Uses byte-addressability and atomic 64b writes

BA A’ B’


15

• Inspired by shadow paging– Optimization: In-place update when possible

Opt. 1: In-Place Writes

• Aligned 64-bit writes are performed in place• Data and metadata


in-place write

16

• Appends committed by updating file size

file’s root pointer + size

in-place append

file size update

17

Opt. 2: Exploit Data-Metadata Invariants

BPFS Example

directory filedirectory

inodefile

root pointer

indirect blocks

inodes

add entry

remove entry

18

• Cross-directory rename bubbles to common ancestor

What happens if you memory-map a file?

• Rely on hardware for 1-word atomic update

➢CPU cache may reorder writes to NVM• Breaks “crash-consistent” update protocols`

Consistent updates

20

0xC02

Write-back Cache

0

NVM

0xDEADBEEFvalue

valid

value

valid1

1

STORE value = 0xC02STORE valid = 1

Primitive operation: ordering writes

• Why?• Ensures ability to commit a change

• How?• Flush – MOVNTQ/CLFLUSH

• Fence – MFENCE

• Inefficiencies:• Removes recent data from cache

21

0

NVMWrite-back cache

0xDEADBEEFvalue

valid

value

valid 1

0xC02

STORE value = 0xC02FLUSH (&value)FENCESTORE valid = 1

BPRAM

L1 / L2

...

CoW

Commit

...

Ordering in BPFS

22

...

CoW

Commit

...

Atomicity in BPFS

L1 / L2

BPRAM

23

Enforcing Ordering and Atomicity

• Ordering• Solution: Epoch barriers to declare constraints

• Faster than write-through

• Important hardware primitive (cf. SCSI TCQ)

• Atomicity• Solution: Capacitor on DIMM

• Simple and cheap!

24

Intel x86 flush mechanism

25

A

ST ACLWB ASFENCE

ACK

SFENCECOMMITS

A

ST A

ST B

CLWB A

CLWB B

SFENCE

ST C

CLWB C

SFENCE

25

Intel x86 flush mechanism

26

Drawback 1: No distinction between ordering and durability

Drawback 2: Ordering introduces stalls

26

ST A

ST B

CLWB A

CLWB B

SFENCE

ST C

CLWB C

SFENCE

Epoch ordering

• Goal:• No software flushes – too expensive/complex

• Ordering is asynchronous – too expensive to stall

• Solution• Persist barriers

Persist barriers: Ordering Fence

28

ST A=1Volatile Memory Order

ST A=1Persistence OrderTime

ST B=2

ST B=2

Thread 1 Thread 1Barrier

• Orders stores preceding barrier before later stores

Happens Before

Ordering Epochs without Flushing

29

CPU 1

Local TS

L1 Cache

2526

1. ST A = 12. ST B = 13. LD R1 = A4. BARRIER5. ST A = 2

A = 1 25A = 2 26

B = 1 25

...

CoW

Barrier

Commit

...

Ordering and Atomicity with Epoch Barriers

L1 / L2

BPRAM

1

1 1

2

Ineligible for eviction!

30

Epoch ordering complexity

• When is it safe to let something leave the cache?• When all writes from preceding epoch have left already

• What happens if you overwrite something from a preceding epoch?• Must flush earlier epoch first – can’t store multiple versions

• What happens when you access something from another core?• Can’t track ordering across cores (epoch numbers across cores aren’t orderd)

• Old data must be flushed

• How do you implement efficiently?• Store 8-bit pointer in each cacheline to registers holding 8 in-flight epochs

Considerations for epoch ordering

• How complex is it?

• How easy to use is it?

Considerations for epoch ordering

• How complex is it?• Need hardware walkers to evict cachelines during cache replacement

• How easy to use is it?• Dependencies across volatile variables not recorded

• Example:

• Could reboot with Y=2, A=4

Acquire(vol_lock);X = 1;Y = 2;Release(vol_lock);

Acquire(vol_lock);A = 4;B = 5;Release(vol_lock);

0

2

4

6

8

10

8 64 512 4096Th

ou

san

ds

Random n Byte Write

Microbenchmarks

0

0.4

0.8

1.2

1.6

2

8 64 512 4096

Tim

e (

s)

Append n Bytes

NTFS - DiskNTFS - RAMBPFS - RAM

34

NOT DURABLE!

NOT DURABLE!

DURABLE!

DURABLE!

Notes from reviews

• How much performance improvement should we expect?

• How important is using real PCM (or real PCM latency) in evaluation?

• Could we have systems with just Pmem and no SSD?

• What journaling mode does NTFS use?• Ordered journaling

• Is modifying HW ok?

• Using volatile structures• Free blocks, freed & allocate inode numbers, • data freed by CoW operation• Dentry cache

How well does it perform?

• Evaluation:• Implement in Windows & run over DRAM (no epoch barrier delays)

• Implement in usermode & run in a simulator

• Analytical model

• Workloads

Documents

File systems for persistent memory