Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
HMVFS: A Hybrid Memory Versioning File System
Shengan Zheng, Linpeng Huang, Hao Liu, Linzhu Wu, Jin Zha
Department of Computer Science and Engineering
Shanghai Jiao Tong University
Introduction
• Emerging Non-Volatile Memory (NVM)
• Persistency as disk
• Byte addressability as DRAM
• Current file systems for NVM
• PMFS, SCMFS, BPFS
• Non-versioning, unable to recover old data
• Hardware and software errors
• Large dataset and long execution time
• Fault tolerance mechanism is needed
• Current versioning file systems
• BTRFS, NILFS2
• Not optimized for NVM
Design Goals
• Strong consistency
• A Stratified File System Tree (SFST) represents the snapshot of whole file system
• Atomic snapshotting is ensured
• Fast recovery
• Almost no redo or undo overhead in recovery
• High performance
• Utilize the byte-addressability of NVM to update the tree metadata at the granularity of bytes
• Log-structured updates to files balance the endurance of NVM
• Avoid write amplification
• User friendly
• Snapshots are created automatically and transparently
Overview
• HMVFS is an NVM-friendly log-structured versioning file system
• Space-efficient file system snapshotting
• HMVFS decouples tree metadata from tree data
• High performance and consistency guarantee
• POSIX compliant
On-Memory Layout
• DRAM: cache and journal
• Sequential write zone
• File metadata and data
• Tree data
• Random write zone
• File system metadata
• Tree metadata
• NVM:
Block Information
Table(BIT)
Node Address Tree Cache(NAT Cache )
Segment Information
Table(SIT)
Random Writes Sequential Writes
NVM
Segment Information
Table Journal
DRAM Checkpoint Information Tree(CIT)
Node Address Tree (NAT)
Main Area (SFST)Auxiliary Information
Node Blocks
Checkpoint Blocks (CP)
Data Blocks
Superblock
Superblock
in traditional Log-structured File Systems
• Update propagation problem
Direct pointerOr
Inline data
Metadata
Single-indirect
Double-indirect
Triple-indirect
Inode block
Direct node
Direct node
Indirect node
Indirect node
Indirect node
Direct node
Direct node
Dat
a bl
ock
Direct node
Indirect node
Data
Node
…
…
… …
… …
… … … … …
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Indirectnode
Inode
Updated blocks
Direct node
Datablock
Index Structure
Index Structure without write amplification
• Node Address Table
Direct pointerOr
Inline data
Metadata
Single-indirect
Double-indirect
Triple-indirect
Inode block
Direct node
Direct node
Indirect node
Indirect node
Indirect node
Direct node
Direct node
Dat
a bl
ock
Direct node
Indirect node
Data
Node
…
…
… …
… …
… … … … …
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Updated blocks
Direct node
Datablock
Node Address Table
Node-ID Address
… …n-1 0x38n 0x42
n+1 0x24… …
0x73
Index Structure for versioning
• Node Address Table with the dimension of version.
Direct pointerOr
Inline data
Metadata
Single-indirect
Double-indirect
Triple-indirect
Inode block
Direct node
Direct node
Indirect node
Indirect node
Indirect node
Direct node
Direct node
Dat
a bl
ock
Direct node
Indirect node
Data
Node
…
…
… …
… …
… … … … …
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Updated blocks
Direct node
Datablock
Node Address Table with Version
Node-ID Address
… …n-1 0x14n
n+1 0x24… …
0x42
Address
…0x38
0x24…
0x42
Address
…0x38
0x24…
0x73
Version1 Version2 Version3
How to store different trees space-efficiently
• Node Address Tree (NAT)
• A four-level B-tree to store multi-version Node Address
Table space-efficiently
• Adopt the idea of CoW friendly B-tree
• NAT leaves contain NodeID-address pairs
• Other tree blocks in NAT contain pointers to lower level
blocks.
Node
NAT root
NAT internal
NAT internal
…
NAT leaf
NAT leaf
……NAT leaf
…
Indirect node
NAT internal
NAT internal
…NAT
internal…
NAT root
NAT internal
NAT internal
NAT leaf
Direct node
InodeDirect node
Node Address Tree
P,1
A,1 B,1 C,1 D,1
E,1 F,1
P,1
A,1 B,1 C,2 D,1
E,2 F,1
Q,1
D',1
F',1
P,0
A,1 B,1 C,1 D,0
E,1 F,0
Q,1
D',1
F',1
Original New
Stratified File System Tree (SFST)
• Four different categories of blocks:
• Checkpoint layer
• Node Address Tree (NAT) layer
• Node layer
• Data layer
• All blocks from SFST are stored in the main area with
log-structured writes
• Balance the endurance of NVM media
• Each SFST represents a valid snapshot of file system
• Share overlapped blocks to achieve space-efficiency
Dat
a bl
ock
Dat
a bl
ock
…
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
…
Node
Data
NAT root
NAT internal
NAT internal
…
NAT leaf
NAT leaf
……NAT leaf
…
Indirect node
NAT internal
NAT internal
…NAT
internal…
NAT root
NAT internal
NAT internal
NAT leaf
Direct node
InodeDirect node
Node Address Tree
Original snapshot New snapshot
CPblock
CPblock
Checkpoint
Stratified File System Tree (SFST)
• The metadata of SFST
• In auxiliary information zone
• Random write updates
• Segment Information Table (SIT)
• Contains the status information of every segment
• Block Information Table (BIT)
• Keeps the information of every block
• Update precisely at variable bytes granularity
• Contains:
• Start and end version number
• Block type
• Node ID
• Reference count
Dat
a bl
ock
Dat
a bl
ock
…
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
…
Node
Data
NAT root
NAT internal
NAT internal
…
NAT leaf
NAT leaf
……NAT leaf
…
Indirect node
NAT internal
NAT internal
…NAT
internal…
NAT root
NAT internal
NAT internal
NAT leaf
Direct node
InodeDirect node
Node Address Tree
Original snapshot New snapshot
CPblock
CPblock
Checkpoint
Garbage Collection in HMVFS
• Move all the valid blocks in the victim segment to the current segment
• When finished, update SIT and create a snapshot
• Handle block sharing problem
NATblock
NodeBlock
1
NodeBlock
2
Version 1
NATblock
NodeBlock
2
Version 2
NATblock
NodeBlock
2
Version 3
NATblock
NodeBlock
2
Version 4
Segment A Segment B
Block Information Table (BIT)
• Block sharing problem
• The corresponding pointer in the parent block must be updated if a new child block is
written in the main area
• Node ID and block type
• Used to locate parent node
Type of the block Type of the parent Node ID
Checkpoint N/A N/A
NAT internalNAT internal Index code in NAT
NAT leaf
Inode
NAT leaf Node IDIndirect
Direct
Data Inode or direct Node ID of parent node
Block Information Table (BIT)
• Start and end version number
• The first and last versions in which the block is valid
• Operations like write and delete set these two variables to the current version
number
• Reference count
• The number of parent nodes which are linked to the block
• Update with lazy reference counting
• File level operations and snapshot level operations update the reference
count
• If the count reaches zero, the block will become garbage
Snapshot Creation
• Strong consistency is guaranteed
• Flush dirty NAT entries from DRAM to form a new
Node Address Tree
• Follow the bottom-up procedure
• Status information are stored in checkpoint block
• Space-efficient snapshot
• The atomicity of snapshot creation is ensured
• Atomic update to the pointer in superblock to announce
the validity of the new snapshot
• Crash during snapshot creation can be recovered by
undo or redo depend on the validity
Node
Data
NAT root
NAT internal
NAT internal
NAT leaf
NAT leaf
NAT leaf
Indirect node
NAT internal
NAT internal
NAT internal
NAT root
NAT internal
NAT internal
NAT leaf
Direct node
InodeDirect node
Node Address Tree
Original snapshot New snapshot
CPblock
CPblock
Checkpoint
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
Dat
a bl
ock
… …
……
… …
…
SuperBlock
Snapshot Deletion
• Deletion starts from the checkpoint block
• Checkpoint cache is stored in DRAM
• Follows the top-down procedure to decrease reference counts
• Consistency is ensured by journaling
• Call garbage collection afterwards
• Many reference counts have decreased to zero
P,0
A,1 B,1 C,1 D,0
E,1 F,0
Q,1
D',1
F',1
P,1
A,1 B,1 C,2 D,1
E,2 F,1
Q,1
D',1
F',1
Crash Recovery
• Mount the writable last completed snapshot
• No additional recovery overhead
• Mount the read-only old snapshots
• Locate the checkpoint block of the snapshot
• Retrieve files via SFST
Checkpoint Checkpoint Checkpoint Checkpoint
Superblock
NAT root
…
NAT root
…
NAT root
…
NAT root
…
Evaluation
• Experimental Setup
• A commodity server with 64 Intel Xeon 2GHz processors and 512GB DRAM
• Performance comparison with PMFS, EXT4, BTRFS, NILFS2
• Postmark results
• Different read bias numbers
020406080
100120140160180
0 % 10 % 2 0 % 30 % 4 0% 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 10 0 %
sec
Percentage of Reads
HMVFS BTRFS NILFS2 EXT4 PMFS
0
20
40
60
80
100
120
0 % 10 % 2 0 % 3 0% 4 0% 50 % 60 % 7 0 % 8 0% 9 0% 1 0 0%
Eff
icie
ncy
(sec
-1)
Percentage of Reads
HMVFS BTRFS NILFS2
Transaction performance Snapshotting efficiency
2.7x and 2.3x
Evaluation
• Filebench results
• Fileserver
• Different numbers of files
0
5
10
15
20
25
2k 4k 8k 16k
ops/
sec
(x10
00)
Number of Files
HMVFS BTRFS NILFS2 EXT4 PMFS
0
5
10
15
20
25
30
35
40
2k 4k 8k 16k
Eff
icie
ncy
( sec
-1)
Number of Files
HMVFS BTRFS NILFS2
Throughput performance Snapshotting efficiency
9.7x and 6.6x
Evaluation
• Filebench results
• Varmail
• Different depths of directories
0
2
4
6
8
10
12
0.7 1.2 1.4 2.1
Eff
icie
ncy
(sec
-1)
Directory Depth
HMVFS BTRFS NILFS2
0
5
10
15
20
25
0.7 1.2 1.4 2.1
ops/
sec
(x10
00)
Directory Depth
HMVFS BTRFS NILFS2 EXT4 PMFS
Throughput performance Snapshotting efficiency
8.7x and 2.5x
Conclusion
• HMVFS is the first file system to solve the consistency problem for NVM-based
in-memory file systems using snapshotting.
• Metadata of the Stratified File System Tree (SFST) is decoupled from data and is
updated at byte granularity
• HMVFS stores the snapshots space-efficiently with shared blocks in SFST and
handles write amplification problem and block sharing problem well
• HMVFS exploits the structural benefit of CoW friendly B-tree and the byte-
addressability of NVM to automatically take frequent snapshots
• HMVFS outperforms tradition versioning file systems in snapshotting and
performance while providing strong consistency guarantee and having little
impact on foreground operations