Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Solaris Data TechnologiesThe Zettabyte Filesystem
Ulrich Gräf
Consultant, OS Ambassador
Core Technology Team
Office Frankfurt, Sun Microsystems
The Perfect Filesystem
Write my data
Keep it safe
Read it back
Do it fast
Don't hassle me
Existing Filesystems
Lots of limits: size, number of files, etc.No data integrity checksNo defense against silent data corruptionNo data security: spying, tampering, theftNumerous performance and scaling
problemsExcruciating to manage
The Next Generation Filesystem Objective
End the suffering!
The ZFS Filesystem
Write my data!Immense capacity (128-bit)
Moore's Law: need 65th bit in 12 years
Zettabyte = 70-bit (a billion TB)
ZFS capacity: 256 quadrillion ZB
Quantum limit of Earth-based storage
Dynamic metadataNo limits on files, directory entries, etc.
No wacky knobs (e.g. inodes/cg)
The ZFS Filesystem
Keep it safe!
Provable data integrity modelComplete end-to-end verification
Detects bit rot, phantom writes, misdirections, common administrative errors
Self-healing data
Disk scrubbing
Real-time remote replication
Data authentication and encryption
The ZFS Filesystem
Do it fast!
Write sequentialization
Dynamic striping across all disks
Multiple block sizes
Constant-time snapshots
Concurrent, constant-time directory ops
Byte-range locking for concurrent writes
The ZFS Filesystem
Don't hassle me!Pooled storage – no more volumes!
Filesystems are cheap and easy to create
Grow and shrink are automatic
No raw device names to remember
No more fsck(1M) ever
No more editing /etc/vfstab
Unlimited snapshots and user undo
All administration online
Volumes vs. Pooled Storage
Traditional volumes
Partition per filesystem; painful to manage
Block-based FS/Volume interface slow, brittle
Pooled Storage
Filesystems share space; easy to mange
Transactional ZPL/DMU interface fast, robust
FS FS FSFS
(ZPL)
FS
(ZPL)
FS
(ZPL)
Storage Pool
(DMU/SPA)Volume Volume Volume
POSIX isn't the only game in town
DMU provides transactional object storage model
Filesystems
Databases
Raw volume emulation
What else might we do?
Object-Based Storage
ZPL NFS Oracle
UFS OtherFS Raw
DMU
SPA
Zvol
UFS/SVM Administration, page 1
Set up partition tables for each disk
Format one disk at a time
Each partition must be the maximum expected size (how do we know?) of the user's filesystem
The partitions on each disk must match
Also, we must set aside partitions for SVM itself to store its configuration database.
Create the SVM metadb
# format
... (long interactive session omitted)
# metadb -a -f disk1:slice0 disk2:slice0
UFS/SVM Administration, page 2
Create Ann's volume (d20)
Volume names must be of the form dNNN
We can't name Ann's volume “ann” (fixed in another project)
# metainit d10 1 1 disk1:slice1
d10: Concat/Stripe is setup
# metainit d11 1 1 disk2:slice1
d11: Concat/Stripe is setup
# metainit d20 -m d10
d20: Mirror is setup
# metattach d20 d11
d20: submirror d11 is attached
UFS/SVM Administration, page 3
Create Bob's volume (d21)
# metainit d12 1 1 disk1:slice2
d12: Concat/Stripe is setup
# metainit d13 1 1 disk2:slice2
d13: Concat/Stripe is setup
# metainit d21 -m d12
d21: Mirror is setup
# metattach d21 d13
d21: submirror d13 is attached
UFS/SVM Administration, page 4
Create Sue's volume (d22)
# metainit d14 1 1 disk1:slice3
d14: Concat/Stripe is setup
# metainit d15 1 1 disk2:slice3
d15: Concat/Stripe is setup
# metainit d22 -m d14
d22: Mirror is setup
# metattach d22 d15
d22: submirror d15 is attached
UFS/SVM Administration, page 5
Create and mount Ann's filesystem
Like SVM, UFS can't name Ann's filesystem “ann”
It's on volume d20, so its name is /dev/md/dsk/d20
Manually add it to /etc/vfstab
# newfs /dev/md/rdsk/d20
newfs: construct a new file system /dev/md/rdsk/d20: (y/n)? y
... (many pages of 'superblock backup' output omitted)
# mount /dev/md/dsk/d20 /export/home/ann
# vi /etc/vfstab
... While in 'vi', type this exactly:
/dev/md/dsk/d20 /dev/md/rdsk/d20 /export/home/ann ufs 2 yes -
UFS/SVM Administration, page 6
Now do the same for Bob and Sue
# newfs /dev/md/rdsk/d21
newfs: construct a new file system /dev/md/rdsk/d21: (y/n)? y
... (many pages of 'superblock backup' output omitted)
# mount /dev/md/dsk/d21 /export/home/bob
# vi /etc/vfstab ... While in 'vi', type this exactly:
/dev/md/dsk/d21 /dev/md/rdsk/d21 /export/home/bob ufs 2 yes -
# newfs /dev/md/rdsk/d22
newfs: construct a new file system /dev/md/rdsk/d22: (y/n)? y
... (many pages of 'superblock backup' output omitted)
# mount /dev/md/dsk/d22 /export/home/sue
# vi /etc/vfstab ... While in 'vi', type this exactly:
/dev/md/dsk/d22 /dev/md/rdsk/d22 /export/home/sue ufs 2 yes -
UFS/SVM Administration, page 7
Later, add more space for Bob
Note: this space isn't available to Ann or Sue
# format
... (long interactive session omitted)
# metattach d12 disk3:slice1
d12: component is attached
# metattach d13 disk4:slice1
d13: component is attached
# metattach d21
# growfs -M /export/home/bob /dev/md/rdsk/d21
/dev/md/rdsk/d21:
... (many pages of 'superblock backup' output omitted)
ZFS Administration
Create a storage pool named “home” # zpool create "home" mirror(disk1,disk2)
Create filesystems “ann”, “bob”, “sue” # zfs mount -c home/ann /export/home/ann
# zfs mount -c home/bob /export/home/bob
# zfs mount -c home/sue /export/home/sue
Later, add space to the “home” pool # zpool add "home" mirror(disk3,disk4)
Provable Data Integrity Model
Three Big Rules
All operations are copy-on-write
Never overwrite live data
On-disk state always valid
No need for fsck(1M)
All operations are transactional
Related changes succeed or fail as a whole
No need for journaling
All data is checksummed
No silent data corruption
No panics on bad metadata
Traditional Checksums
Checksums stored with data blocks
Fine for detecting bit rot, but:
Can't detect phantom writes, misdirections
Can't validate the checksum itself
Can't authenticate the data
Can't detect common administrative errors
Data
Checksum
Data
Checksum
Data
Checksum
Data
Checksum
ZFS Checksums
Checksums stored with indirect blocks
Self-validating, self-authenticating checksum tree
Detects phantom writes, misdirections, common administrative errors (e.g. swap on active ZFS disk)
Address
Checksum
Data Data Data Data
Checksum
Address Address
ChecksumChecksum
Address
Address
ChecksumChecksum
Address
Self-Healing Data
Media error under traditional FS:
Bad user data causes silent data corruption
Bad metadata causes SDC, panic, or both
Media error under ZFS:
Checksum detects data corruption
SPA gets valid data from another replica
SPA repairs the damaged replica
ZFS returns valid data to application
No human intervention required
A non-event
ZFS S10 Content
Simple AdministrationPooled storageUnlimited snapshotsNT-style ACLs and extended attributesQuotas & reservationsLeast privilege – lets end users do moreOnline everythingAutomatic grow/shrinkDynamic metadata – no wacky knobsHot spaceHost-neutral on-disk format (import/export)
ZFS S10 Content
Provable end-to-end data integrity64-bit self-validating checksums on every block
Historically considered “too expensive”Turns out, no it isn'tAnd the alternative is unacceptable
Always-consistent on-disk format
Never needs fsckDoesn't depend on journaling
Self-healing data in mirrored configurations
ZFS S10 Content
Smokin' performanceAlready faster than UFS on most benchmarks
Under 1 second file system creation
Dynamic striping – automatically maximizes bandwidth
Intrinsically hot-spot free due to COW model
Constant-time directory operations (create/delete)
Multiple block sizes
ZFS S10 Content
Scalability16 exabytes per file system
Many zettabytes per pool
Unlimited files per file system
Unlimited file systems per pool
Unlimited directory size
Unlimited number of devices
No O(data) operations