ZFS Talk Part 1

Preview:

DESCRIPTION

A presentation about ZFS aimed at developers. Given at Datto, see the talk here https://www.youtube.com/watch?v=Wd6eacYeeJI

Citation preview

zfs to developers

zfs, a modern file system built forlarge scale data integrity

wikipedia to the rescue!

NFSLustreOpen Office

Sun Microsystems

They also made a large amount of hardware

http://zfsonlinux.org/docs/LUG12_ZFS_Lustre_for_Sequoia.pdf

2005, integrated into solaris kernel

2013OpenZFS

2010 illumos founded

45 commits to ZoL 1174 commits to ZoL

2008 first commit to ZoL

-File systems should be large-Storage media is not to be trusted-Storage maintenance should be easy-Disk storage should be more like ram

File systems should be largeOur largest system was 144 TB of storage.

disks * capacity36 * 4

ZFS can address hard drives so large they could not be stored on this planet.

File systems should be largeext4 1 EXBHFS+ 1 EXBBTRFS 16 EXBzfs 256X(1024) EXB

File systems should be largewho cares?

Storage media is not to be trusted

-Spinning disks have a bit error rate-Sometimes the head writes to the wrong place-”Modern hard disks write so fast and so faint that they are only guessing that what they read is what you wrote”-Cables go bad-Cosmic rays (!!!)

Storage media is not to be trustedzfs overcomes these problems with checksumming. Every block is run through fletcher4 before it is written, and that checksum is combined with other metadata and written “far away” from the data when they are written out.

sha256futureEdon-RSkein

Storage media is not to be trusted

Does not happen too often, is usually just a great early warning that the drive is failing

Storage maintenance should be easy

zpool create name diskszfs create filesystemzfs set compression=off filesystemzfs set sync=disabled filesystemzpool statuszfs destroy

Storage maintenance should be easy

Is it intuitive?

zfs snapshotzfs send/receivezfs create/destroy

Storage maintenance should be easy

Is it intuitive?

zpool add VS zpool attach

Storage maintenance should be easy

Is it easy?

I think so

Disk storage should be more like ram

Should open a computer up, throw some disks in there and be running. Never need to mess with it, never need to tune it.

Disk storage should be more like ram

FAIL

tuning is not recommended

Disk storage should be more like ram

“Tuning is evil, yes, in the way that doing something against the will of the creator is evil”

zfs sits above your hard drives and below your directory, it adds features you might like.

zfs sits above your hard drives and below your directory, it adds features you might like.

data integrity

trasnparent compression (LZ4)

improved throughput

snapshoting replication via snapshoting

speed via ARC

easy maintancne

choice in raid setup

Command overview

zfszpool

zdb

Command overview

zfs every weekzpool every month

zdb depends on the day

Command overview

zfs Awesome man pagezpool Awesome man page

zdb meh...

zpool create

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

/dev/disk/by-id/ata-*

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

/dev/disk/by-id/ata-*

http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool

zpool status/home/sburgess > zpool status pool: tank state: ONLINE scan: scrub repaired 0 in 19h39m with 0 errors on Tue Jul 15 10:23:16 2014config:

NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ONLINE 0 0 0 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468 ONLINE 0 0 0

so far/home/sburgess > zpool get all tank

NAME PROPERTY VALUE SOURCE

tank size 928G -

tank capacity 34% -

tank health ONLINE -

so far/home/sburgess > zfs get all tank

NAME PROPERTY VALUE SOURCE tank type filesystem - tank creation Thu Jan 3 15:55 2013 - tank used 325G - tank available 589G - tank referenced 184K - tank compressratio 1.54x - tank mounted yes - tank recordsize 128K default tank mountpoint /tank default tank compression lz4 local tank sync standard default tank refcompressratio 1.00x

so far/home/sburgess > ls /tank/

zfs create

zfs create tank/home

zfs create

zfs create tank/home/sburgess-o mountpoint=/home/sburgess

zfs create

zfs create tank/home/sburgess/downloadszfs create tank/home/sburgess/projectszfs create tank/home/sburgess/tools

zfs create

zfs create tank/home/sburgess/downloadszfs create tank/home/sburgess/projectszfs create tank/home/sburgess/tools

chown -R sburgess: /home/sburgess

zfs createzfs list -o name,refer,used,compressratio -r tank/home/sburgess

NAME REFER USED RATIOtank/home/sburgess 4.37G 114G 1.73xtank/home/sburgess/downloads 34.8G 36.0G 1.66xtank/home/sburgess/projects 2.08G 11.7G 1.30xtank/home/sburgess/tools 583M 635M 1.54x

zfs createmv Pictures pic

zfs create tank/home/sburgess/Pictures

chown -R surgess: Pictures

mv pic/* Pictures

zfs create/home/sburgess > zfs list -o name,refer,used,compressratio -r tank/home/sburgessNAME REFER USED RATIOtank/home/sburgess 4.36G 114G 1.73xtank/home/sburgess/Pictures 11.3M 11.3M 1.16xtank/home/sburgess/downloads 34.8G 36.0G 1.66xtank/home/sburgess/projects 2.08G 11.7G 1.30xtank/home/sburgess/tools 583M 635M 1.54x

zfs createshopt -s dotglob

du -hs *

2.9G .kde

1.3G .cache

uberblock

uberblockThe root of the zfs hash tree

“A Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels of its children nodes.”

uberblockzdb -u poolName

uberblockzdb -u test

Uberblock: magic = 0000000000bab10c version = 5000 txg = 5 guid_sum = 16411893724316372364 timestamp = 1392754246 UTC = Tue Feb 18 15:10:46 2014

uberblockUberblock: magic = 0000000000bab10c version = 5000 txg = 5 guid_sum = 16411893724316372364 timestamp = 1392754246 UTC = Tue Feb 18 15:10:46 2014

… cat /dev/urandom > file …

Uberblock: magic = 0000000000bab10c version = 5000 txg = 163 guid_sum = 16411893724316372364 timestamp = 1392755035 UTC = Tue Feb 18 15:23:55 2014

uberblockUberblock: magic = 0000000000bab10c version = 5000 txg = 163 guid_sum = 16411893724316372364 timestamp = 1392755035 UTC = Tue Feb 18 15:23:55 2014

… zpool attach pool disk1 disk2…

Uberblock: magic = 0000000000bab10c version = 5000 txg = 197 guid_sum = 16865875370843337150 timestamp = 1392755190 UTC = Tue Feb 18 15:26:30 2014

uberblockGo back in time via

zpool import -F

snapshotting

snapshotting

zfs snapshot tank/home/sburgess@now

snapshotting

zfs list -o name,creation,used -t all -r tank/home/sburgess

What to do with snapshots

.zfs directory

Always there, whether or not it shows up in ls -a is controlled by

zfs set snapdir=hidden|visible filesystem

.zfs directory

Contains .zfs/snapshots, which has a directory for each snapshot. When you access any directory, it is temporarily mounted read only there.

.zfs directory

Use case:

-Test if/when a file was created

-Easily restore a file or two, for large complicated restores, use clone.

zfs rollback

zfs rollback tank/home/sburgess@then

Should be the most recent snapshot, but you can use -r to roll back further

zfs rollback

Use case:

Being too bold with tar -x

zfs clone

zfs clone tank/home/sburgess@now tank/other

tank/other is aread/write, snapshotable, cloneable file system

Initially shares all blocks with the parent, takes 0 space, amplify ARC hits

zfs clone

Use case:

Virtual Machine base images

All configs, modules, programs and OS data shared

zfs clone

zfs clone-o readonly=on-o mountpoint=/tmp/rotank/home/sburgess@now tank/other

zfs clone

-safe (readonly)-0 time-0 space

zfs clone

Use case:

-large file restore-diffing files across both

zfs clone

What clones of this snapshot exist?zfs get clones filesystem@snapshot

What snapshot was this filesystem cloned from?zfs get origin filesystem

a note on -

“-” is zfs none/null/not applicable

zfs get clones tankNAME PROPERTY VALUE SOURCEtank clones - -

zfs get origin tank@nowNAME PROPERTY VALUE SOURCEtank@now origin - -

a note on -

“-” is zfs none/null/not applicable

zpool get version

NAME PROPERTY VALUE SOURCEtank version - default

a note on 5000

zpool version numbers no longer increase with features

zfs send

zfs send

Original idea:

Send the changes I made today across the ocean

zfs send

Create a file detailing the changes that need to be made to transition a filesystem from one snapshot to another.

zfs send

zfs send is a dictation, not a conversation

zfs sendzpool create -O compression=off -O copies=2 -o ashift=12

zpool create -O compression=lz4 -O checksum=sha256 -o ashift=9

zfs send

zfs send tank/currr@1387825261Error: Stream can not be written to a terminal.You must redirect standard output.

zfs send

-n

-v

zfs send

zfs send -n -v tank/home/sburgess@now

zfs send

zfs send -n -v tank/home/sburgess@nowsend from @ to tank/home/sburgess@now total estimated size is 9.22G

zfs send

zfs send tank/home/sburgess@now

What does this send? What does it create when its received?

zfs send

zfs send tank/home/sburgess@now

Its sends a “full” filesystem, everything that is needed to create tank/home/sburgess@now

The receiving side gets a new FS with a single snapshot named now

zfs send

Can be used with the -i and -I options to send incremental changes. Only send the blocks that changed between the first and second snapshots.

zfs send

-i do not send intermediate snapshots

-I send intermediate snapshots

zfs send

-i do not send intermediate snapshots

-I send intermediate snapshots

zfs send -I early file/system/path@late

zfs get vs zfs list

zfs get vs zfs list

When working interactively use zfs list

zfs list -t all -o name,written,used,mounted

NAME WRITTEN USED MOUNTED

tank/home/sburgess/tools@1387825261 0 0 -

tank/images 590M 8.82G no

tank/images@base 8.25G 369M -

tank/other 8K 8K yes

tank/trick 0 136K yes

zfs get vs zfs list

zfs list is the same as

zfs list -o name,used,avail,refer,mountpoint

zfs get vs zfs list

zfs list is the same as

zfs list -o name,used,avail,refer,mountpoint ^^^^

zfs get vs zfs list

zfs list | grep/awk/??

zfs get vs zfs list

when looking at an FS or snapshot, I callzfs get all item | less

zfs get vs zfs list

For programmatic use, use zfs get -H -P

zfs get used tank

NAME PROPERTY VALUE SOURCE

tank used 484G -

zfs get used -o value -H -p tank

519265562624

Learn more

read the zpool man page

read the zfs man page

subscribe to the ZoL mailing list, and just read new messages as they come in