Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

Embed Size (px)

Citation preview

Future of CephFS

Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP

APP

HOST/VM

CLIENT

Finally, lets talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone youve ever met (and everyone theyve ever met).

M

M

M

CLIENT

0110

data

metadata

Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.

M

M

M

There are multiple MDSs!

Metadata Server

Manages metadata for a POSIX-compliant shared filesystemDirectory hierarchy

File metadata (owner, timestamps, mode, etc.)

Stores metadata in RADOS

Does not serve file data to clients

Only required for shared filesystem

If you arent running Ceph FS, you dont need to deploy metadata servers.

legacy metadata storage

a scaling disastername inode block list data

no inode table locality

fragmentationinode table

directory

many seeks

difficult to partition

usretcvarhomevmlinuzpasswdmtabhostslib

includebin

ceph fs metadata storage

block lists unnecessary

inode table mostly uselessAPIs are path-based, not inode-based

no random table access, sloppy caching

embed inodes inside directoriesgood locality, prefetching

leverage key/value object

102

100

1

usretcvarhome

vmlinuz

passwdmtab

hosts

libinclude

bin

controlling metadata io

view ceph-mds as cachereduce readsdir+inode prefetching

reduce writesconsolidate multiple writes

large journal or logstripe over objects

two tiersjournal for short term

per-directory for long term

fast failure recovery

journal

directories

one tree

three metadata servers

??

So how do you have one tree and multiple servers?

load distribution

coarse (static subtree)preserve locality

high management overhead

fine (hash)always balanced

less vulnerable to hot spots

destroy hierarchy, locality

can a dynamic approach capture benefits of both extremes?

static subtree

hash directories

hash files

good locality

good balance

If theres just one MDS (which is a terrible idea), it manages metadata for the entire tree.

When the second one comes along, it will intelligently partition the work by taking a subtree.

When the third MDS arrives, it will attempt to split the tree again.

Same with the fourth.

DYNAMIC SUBTREE PARTITIONING

A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and its called dynamic subtree partitioning.

scalablearbitrarily partition metadata

adaptivemove work from busy to idle servers

replicate hot metadata

efficienthierarchical partition preserve locality

dynamicdaemons can join/leave

take over for failed nodes

dynamic subtree partitioning

Dynamic partitioning

many directories

same directory

Failure recovery

Metadata replication and availability

Metadata cluster scaling

client protocol

highly statefulconsistent, fine-grained caching

seamless hand-off between ceph-mds daemonswhen client traverses hierarchy

when metadata is migrated between servers

direct access to OSDs for file I/O

an example

mount -t ceph 1.2.3.4:/ /mnt3 ceph-mon RT

2 ceph-mds RT (1 ceph-mds to -osd RT)

cd /mnt/foo/bar2 ceph-mds RT (2 ceph-mds to -osd RT)

ls -alopen

readdir1 ceph-mds RT (1 ceph-mds to -osd RT)

stat each file

close

cp * /tmpN ceph-osd RT

ceph-mon

ceph-mds

ceph-osd

recursive accounting

ceph-mds tracks recursive directory statsfile sizes

file and directory counts

modification time

virtual xattrs present full stats

efficient

$ ls -alSh | headtotal 0drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 .drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 ..drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomcephdrwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1drwx--x--- 1 luko adm 19G 2011-01-21 12:17 lukodrwx--x--- 1 eest adm 14G 2011-02-04 16:29 eestdrwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzycephdrwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph

snapshots

volume or subvolume snapshots unusable at petabyte scalesnapshot arbitrary subdirectories

simple interfacehidden '.snap' directory

no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

multiple client implementations

Linux kernel clientmount -t ceph 1.2.3.4:/ /mnt

export (NFS), Samba (CIFS)

ceph-fuse

libcephfs.soyour app

Samba (CIFS)

Ganesha (NFS)

Hadoop (map/reduce)

kernellibcephfscephfuseceph-fuseyour applibcephfsSambalibcephfsGanesha

NFS

SMB/CIFS

libcephfsHadoop

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP

APP

HOST/VM

CLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOME

AWESOME

AWESOME

AWESOME

Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.

Path forward

TestingVarious workloads

Multiple active MDSs

Test automationSimple workload generator scripts

Bug reproducers

HackingBug squashing

Long-tail features

IntegrationsGanesha, Samba, *stacks




librados

object model

pools1s to 100s

independent namespaces or object collections

replication level, placement policy

objectsbazillions

blob of data (bytes to gigabytes)

attributes (e.g., version=12; bytes to kilobytes)

key/value bundle (bytes to gigabytes)

atomic transactions

client operations send to the OSD clusteroperate on a single object

can contain a sequence of operations, e.g.truncate object

write new object data

set attribute

atomicityall operations commit or do not commit atomically

conditional'guard' operations can control whether operation is performedverify xattr has specific value

assert object is a specific version

allows atomic compare-and-swap etc.

key/value storage

store key/value pairs in an objectindependent from object attrs or byte data payload

based on google's leveldbefficient random and range insert/query/removal

based on BigTable SSTable design

exposed via key/value APIinsert, update, remove

individual keys or ranges of keys

avoid read/modify/write cycle for updating complex objectse.g., file system directory objects

watch/notify

establish stateful 'watch' on an objectclient interest persistently registered with object

client keeps session to OSD open

send 'notify' messages to all watchersnotify message (and payload) is distributed to all watchers

variable timeout

notification on completionall watchers got and acknowledged the notify

use any object as a communication/synchronization channellocking, distributed coordination (ala ZooKeeper), etc.

CLIENT #1CLIENT #2CLIENT #3

OSD

watch

ack/commit

ack/commit

watch

ack/commit

watch

notify

notify

notify

notify

ack

ack

ack

complete

watch/notify example

radosgw cache consistencyradosgw instances watch a single object (.rgw/notify)

locally cache bucket metadata

on bucket metadata changes (removal, ACL changes)write change to relevant bucket object

send notify with bucket name to other radosgw instances

on receipt of notifyinvalidate relevant portion of cache

rados classes

dynamically loaded .so/var/lib/rados-classes/*

implement new object methods using existing methods

part of I/O pipeline

simple internal API

readscan call existing native or class methods

do whatever processing is appropriate

return data

writescan call existing native or class methods

do whatever processing is appropriate

generates a resulting transaction to be applied atomically

class examples

grepread an object, filter out individual records, and return those

sha1read object, generate fingerprint, return that

imagesrotate, resize, crop image stored in object

remove red-eye

cryptoencrypt/decrypt object data with provided key

ideas

distributed key/value tableaggregate many k/v objects into one big 'table'

working prototype exists (thanks, Eleanor!)

ideas

lua rados classembed lua interpreter in a rados class

ship semi-arbitrary code for operations

json classparse, manipulate json structures

ideas

rados mailbox (RMB?)plug librados backend into dovecot, postfix, etc.

key/value object for each mailboxkey = message id

value = headers

object for each message or attachment

watch/notify to delivery notification

hard links?

rare

useful locality propertiesintra-directory

parallel inter-directory

on miss, file objects provide per-file backpointersdegenerates to log(n) lookups

optimistic read complexity

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles